SEP24

WED2025

Explore vs Exploit in the Age of AI

Balancing exploration vs exploitation by tuning

\epsilon

\mathbb{E}[R \mid \pi_\epsilon]

stays above ship-it.

StudyPracticeEngineeringAIProductivityStrategy

When a team works on a greenfield project, every hour can go to one of two buckets:

Exploit: double down on what we already know, refine execution, push features out the door.
Explore: read broadly, try new tools, sketch architectures that might pay off later.

This is the classic explore–exploit tradeoff: spend too much time exploiting and you risk missing the higher hill next door; spend too much time exploring and nothing ships.

Building a simple model

Let’s build a toy model from first principles so we can reason about the policy knob $u_t$ (time share on exploration).

1) States and choices

At time $t$ a developer has:

$S_t$ : execution skill (how fast and clean they ship).
$B_t$ : breadth (how many patterns, tools, and mental models they can reach for).

They split the next unit of time:

$u_t \in [0,1]$ on exploration (reading, prototyping).
$1-u_t$ on exploitation (building).

2) How skill and breadth evolve

Practice helps with diminishing returns; knowledge decays without reinforcement:

S_{t+1} = S_t + a (1 - u_t)\, g(S_t) + c\, u_t\, h(B_t) - \phi_S S_t

B_{t+1} = B_t + b\, u_t - \phi_B B_t

with $g(S) = 1/(1+S)$ and $h(B) = 1 + k_B B$ as simple monotone choices. Parameters $a, b, c, \phi_S, \phi_B, k_B > 0$ .

3) Where output comes from

Two channels:

Core execution (planned work), complementarity between skill and breadth:

P_{\text{core}}(S,B) = \alpha\, S_t^{\eta}\, (1 + k\, B_t)

Opportunistic wins (serendipity), where idea arrivals rise with breadth and payoff grows with both:

P_{\text{opp}}(S,B) = \lambda(B_t)\,\bar{V}(S_t,B_t)

with $\lambda(B_t) = \lambda_0 + \lambda_1 B_t$ and $\bar{V}(S,B) = v_0 + v_S S_t + v_B B_t + v_{SB} S_t B_t$ .

Total output per step:

R_t = P_{\text{core}}(S_t, B_t) + P_{\text{opp}}(S_t, B_t)

We track discounted payoff $PV_t = \sum_{s \le t} \beta^s R_s$ for $\beta \in (0,1)$ .

How AI changes the picture

AI does not change the structure; it changes the slopes.

Execution gets cheaper. Autocomplete, tests, scaffolding make routine work less differentiating. Parameters $\alpha, \eta, k$ go down.
Exploration pays faster. Prototypes get cheaper, so breadth turns into wins more directly. Parameters $\lambda_1, v_{SB}$ go up.

In symbols, same $R_t$ form with primed parameters:

R^{\text{AI}}_t = \alpha' S_t^{\eta'} (1 + k' B_t) + (\lambda_0 + \lambda_1' B_t)\bigl(v_0 + v_S' S_t + v_B B_t + v_{SB}' S_t B_t \bigr)

Try it yourself

Controls

Horizon T (steps)

Discount beta

World

Compare worlds overlay

Policy

Constant u

Floor u_min

Initial S0

Initial B0

Output over time

Compare your chosen policy against a no-reading baseline. Toggle "Compare worlds" to overlay Pre‑AI and With‑AI.

State trajectories

Skill $S_t$ on the left axis, breadth $B_t$ on the right. Notice how a small floor on $u_t$ prevents $B_t$ from eroding.

Discounted payoff

Cumulative $PV_t = \sum_{s \le t} \beta^s R_s$ . This makes the long-horizon effect visible: with AI, exploration compounding shows up sooner.

Policy in plain English

Front load exploration early in greenfield (20 to 30 percent), then taper.
Maintain a floor (for example, 10 percent) so breadth does not decay under deadline pressure.
Demand artifacts from exploration (memos, experiments, small frameworks).
Measure step change wins: deletions that simplify code, cost cliffs, 2x speedups.

Why breadth multiplies execution

If breadth raises the number of viable options from $N$ to $N'$ , the expected best choice improves roughly like the expected maximum of $N$ draws:

\mathbb{E}[\max X_{1..N}] \approx \mu + \sigma \sqrt{2\ln N}.

That “better choice” bump is what $(1 + k B)$ is standing in for.

Closing thought

Pre‑AI, explore vs exploit was a knife edge. With AI, execution is cheaper and exploration is more valuable. The trick is not to abandon building, but to raise the floor on reading and thinking so today’s breadth becomes tomorrow’s speed.