Explore vs Exploit in the Age of AI

Balancing exploration vs exploitation by tuning ϵ\epsilon so E[Rπϵ]\mathbb{E}[R \mid \pi_\epsilon] stays above ship-it.

9/24/2025


When a team works on a greenfield project, every hour can go to one of two buckets:

  • Exploit: double down on what we already know, refine execution, push features out the door.
  • Explore: read broadly, try new tools, sketch architectures that might pay off later.

This is the classic explore–exploit tradeoff: spend too much time exploiting and you risk missing the higher hill next door; spend too much time exploring and nothing ships.


Building a simple model

Let’s build a toy model from first principles so we can reason about the policy knob utu_t (time share on exploration).

1) States and choices

At time tt a developer has:

  • StS_t: execution skill (how fast and clean they ship).
  • BtB_t: breadth (how many patterns, tools, and mental models they can reach for).

They split the next unit of time:

  • ut[0,1]u_t \in [0,1] on exploration (reading, prototyping).
  • 1ut1-u_t on exploitation (building).

2) How skill and breadth evolve

Practice helps with diminishing returns; knowledge decays without reinforcement:

St+1=St+a(1ut)g(St)+cuth(Bt)ϕSStS_{t+1} = S_t + a (1 - u_t)\, g(S_t) + c\, u_t\, h(B_t) - \phi_S S_t Bt+1=Bt+butϕBBtB_{t+1} = B_t + b\, u_t - \phi_B B_t

with g(S)=1/(1+S)g(S) = 1/(1+S) and h(B)=1+kBBh(B) = 1 + k_B B as simple monotone choices. Parameters a,b,c,ϕS,ϕB,kB>0a, b, c, \phi_S, \phi_B, k_B > 0.

3) Where output comes from

Two channels:

  • Core execution (planned work), complementarity between skill and breadth:
Pcore(S,B)=αStη(1+kBt)P_{\text{core}}(S,B) = \alpha\, S_t^{\eta}\, (1 + k\, B_t)
  • Opportunistic wins (serendipity), where idea arrivals rise with breadth and payoff grows with both:
Popp(S,B)=λ(Bt)Vˉ(St,Bt)P_{\text{opp}}(S,B) = \lambda(B_t)\,\bar{V}(S_t,B_t)

with λ(Bt)=λ0+λ1Bt\lambda(B_t) = \lambda_0 + \lambda_1 B_t and Vˉ(S,B)=v0+vSSt+vBBt+vSBStBt\bar{V}(S,B) = v_0 + v_S S_t + v_B B_t + v_{SB} S_t B_t.

Total output per step:

Rt=Pcore(St,Bt)+Popp(St,Bt)R_t = P_{\text{core}}(S_t, B_t) + P_{\text{opp}}(S_t, B_t)

We track discounted payoff PVt=stβsRsPV_t = \sum_{s \le t} \beta^s R_s for β(0,1)\beta \in (0,1).


How AI changes the picture

AI does not change the structure; it changes the slopes.

  • Execution gets cheaper. Autocomplete, tests, scaffolding make routine work less differentiating. Parameters α,η,k\alpha, \eta, k go down.
  • Exploration pays faster. Prototypes get cheaper, so breadth turns into wins more directly. Parameters λ1,vSB\lambda_1, v_{SB} go up.

In symbols, same RtR_t form with primed parameters:

RtAI=αStη(1+kBt)+(λ0+λ1Bt)(v0+vSSt+vBBt+vSBStBt)R^{\text{AI}}_t = \alpha' S_t^{\eta'} (1 + k' B_t) + (\lambda_0 + \lambda_1' B_t)\bigl(v_0 + v_S' S_t + v_B B_t + v_{SB}' S_t B_t \bigr)

Try it yourself

Controls

Output over time

Compare your chosen policy against a no-reading baseline. Toggle "Compare worlds" to overlay Pre‑AI and With‑AI.

051015202530354045505500.511.522.53
With AI (policy)With AI (no reading)Pre-AI (policy)Pre-AI (no reading)Output R_t: policy vs no reading, and Pre-AI vs With AITime (t)Output R_t

State trajectories

Skill StS_t on the left axis, breadth BtB_t on the right. Notice how a small floor on utu_t prevents BtB_t from eroding.

051015202530354045505500.511.522.533.54
With AI S_t (policy)With AI S_t (no reading)Pre-AI S_t (policy)Pre-AI S_t (no reading)Exploit trajectory: Skill S_t (policy vs no reading)Time (t)Skill S_t
051015202530354045505500.20.40.60.81
With AI B_t (policy)With AI B_t (no reading)Pre-AI B_t (policy)Pre-AI B_t (no reading)Explore trajectory: Breadth B_t (policy vs no reading)Time (t)Breadth B_t

Discounted payoff

Cumulative PVt=stβsRsPV_t = \sum_{s \le t} \beta^s R_s. This makes the long-horizon effect visible: with AI, exploration compounding shows up sooner.

0510152025303540455055010203040506070
With AI (policy)With AI (no reading)Pre-AI (policy)Pre-AI (no reading)Discounted payoff PV over timeTime (t)PV = sum_{s<=t} beta^s R_s

Policy in plain English

  • Front load exploration early in greenfield (20 to 30 percent), then taper.
  • Maintain a floor (for example, 10 percent) so breadth does not decay under deadline pressure.
  • Demand artifacts from exploration (memos, experiments, small frameworks).
  • Measure step change wins: deletions that simplify code, cost cliffs, 2x speedups.

Why breadth multiplies execution

If breadth raises the number of viable options from NN to NN', the expected best choice improves roughly like the expected maximum of NN draws:

E[maxX1..N]μ+σ2lnN.\mathbb{E}[\max X_{1..N}] \approx \mu + \sigma \sqrt{2\ln N}.

That “better choice” bump is what (1+kB)(1 + k B) is standing in for.


Closing thought

Pre‑AI, explore vs exploit was a knife edge. With AI, execution is cheaper and exploration is more valuable. The trick is not to abandon building, but to raise the floor on reading and thinking so today’s breadth becomes tomorrow’s speed.