NOV2

SUN2025

Deal Sourcing as a Contextual Bandit

Budgeted Thompson Sampling with

a_t = \arg\max_x \theta_t^\top x

under

\sum_t c_t \le B

private equityoriginationbayesianbanditsthompson-samplingstrategydecision-theory

Private-market origination is the definition of explore–exploit. You can pound the same broker list and milk familiar sectors—exploit—or scout new channels, geographies, and founder communities—explore. The more you over-mine a vein, the worse its marginal returns. The more you explore, the more you risk wasting cycles.

This post turns that trade-off into a contextual bandit with a budget. The policy is Thompson Sampling: sample plausible worlds from your posterior, choose the best channels in that world subject to cost, then learn from what happened. The outcome is a weekly, auditable allocation that lifts IOI/LOI yield and surfaces new seams.

IOI stands for Indication of Interest and LOI stands for Letter of Intent. When we say “IOI/LOI yield,” we mean the proportion of outreach attempts that advance to those milestones. Higher yield means more real deals per unit of origination effort.

Motivation in one picture

Diminishing returns are real: spam a broker and your hit-rate falls; work one sector and you saturate. Meanwhile, new channels (founder communities, niche events, newsletter ads, cold outreach) are uncertain but sometimes rich.

Feel how quickly returns decay as spend climbs. Each ribbon shows the posterior mean meetings per hour and its 80% credible band; dollar-for-dollar comparisons jump out visually once channels start to flatten.

Data schema (keep it thin, keep it clean)

Each candidate outreach (an “arm pull”) has:

Context features $x \in \mathbb{R}^p$ :
Channel (broker, inbound, events, cold, content), Sector, Region, Stage of target (seed, growth, mature), and Cost $c$ (cash + time).
Funnel outcomes recorded as Bernoulli stages:
- $Y^{(1)} =$ Meeting set (0/1)
- $Y^{(2)} =$ IOI (0/1)
- $Y^{(3)} =$ LOI (0/1)
- $Y^{(4)} =$ Closed (0/1)
Deal value proxy $V$ (optional): expected NPV of a close (can be a fixed target value or a model of quality).

We will treat each stage’s probability as a function of $x$ , and define a value per pull.

Value per pull: if $S$ stages must all succeed to “close,” an actionable expectation is

\mathbb{E}[R \mid x] \;=\; \mathbb{E}[V \mid x]\;\prod_{s=1}^{S} p_s(x),

with $p_s(x) = \Pr\!\big(Y^{(s)}=1 \mid x\big)$ and $S\in\{3,4\}$ . If $V$ is unknown, set $\mathbb{E}[V \mid x]=V_0$ as a policy target and refine later.

A Bayesian model you can actually run

Stage models

Start simple: Beta–Bernoulli per channel for the top stage (meeting) and graduate to contextual logistic as data thicken.

Non-contextual (cold-start): for channel $a$ and stage $s$ ,
$p_{s,a} \sim \mathrm{Beta}(\alpha_{s,a}, \beta_{s,a}),\qquad Y^{(s)} \mid p_{s,a} \sim \mathrm{Bernoulli}(p_{s,a}).$
Posterior updates are additive: $\alpha \leftarrow \alpha + y$ , $\beta \leftarrow \beta + (1-y)$ .
Contextual (scales with features): for stage $s$ ,
$\Pr\!\big(Y^{(s)}=1 \mid x, \beta_s\big) = \sigma\!\big(x^\top \beta_s\big), \quad \beta_s \sim \mathcal{N}(0, \Sigma_{0,s}),$
with $\sigma(z) = (1+e^{-z})^{-1}$ .

Practicalities: fit by MCMC (e.g., Pólya–Gamma augmentation) or Laplace/variational approximations. For thin data, make it hierarchical so sectors/regions share strength: partial pooling on intercepts or per-channel slopes.

Reward and cost

Let net value of one pull be

\mathrm{EV}(x) \;=\; \underbrace{\mathbb{E}[V \mid x]}_{\text{target value}}\;\prod_{s=1}^{S} p_s(x) \;-\; c(x).

$c(x)$ should include time in hours × cost/hour, fees, and broker dilution.
If you only care about probability of LOI, set $\mathbb{E}[V\mid x]=1$ and treat EV as a yield measure.

Thompson Sampling with a budget

At decision time $t$ you have a weekly budget $B_t$ (dollars or hours) and a catalog of candidate pulls $\mathcal{A}_t$ with contexts $\{x_a\}$ and costs $\{c_a\}$ .

Thompson Sampling step:

Sample parameters from the posterior:
$\tilde{\beta}_s \sim p(\beta_s \mid \text{data}), \qquad \tilde{V}_a \sim p(V \mid x_a,\text{data}).$
Score each candidate:
$\widetilde{\mathrm{EV}}_a \;=\; \tilde{V}_a \prod_{s=1}^{S} \sigma(x_a^\top \tilde{\beta}_s) \;-\; c_a,\qquad \widetilde{\mathrm{ROI}}_a \;=\; \frac{\tilde{V}_a \prod_s \sigma(x_a^\top \tilde{\beta}_s)}{c_a}.$
Select under budget (a knapsack): choose a set $S_t \subseteq \mathcal{A}_t$ solving
$\max_{S_t} \sum_{a\in S_t} \widetilde{\mathrm{EV}}_a \quad\text{s.t.}\quad \sum_{a\in S_t} c_a \le B_t,\ \text{and coverage constraints if any.}$
Use a greedy $\widetilde{\mathrm{ROI}}$ ranking; it is a strong approximation to the knapsack.
Execute, observe stage outcomes, and update posteriors.

This chart turns a single posterior draw into an actionable weekly plan: bars encode sampled ROI, the whiskers show how fragile that bet still is, and the hover points call out the hours the knapsack filled for each channel.

An explicit “exploration budget”

Guarantee learning by reserving a floor $u_t \in [0,1]$ of $B_t$ for exploration: pick some arms by information gain rather than ROI.

A simple proxy is posterior entropy on stage-1 success:

H_a = -\!\int_0^1 \! \mathrm{Beta}(p \mid \alpha_{1,a}, \beta_{1,a}) \,\log p \, dp,

or use the delta between old and expected-after-one-pull entropy. Allocate the exploration slice to the highest $H_a/c_a$ candidates.

How the allocation evolves

The mix never stays static: as posteriors tighten, high-uncertainty channels earn less exploration budget while proven ROI channels soak up more of the weekly hours. The simulated views below walk through 16 planning cycles to show how a Thompson Sampling desk gradually tilts toward the channels it has validated, while still defending an exploration floor. The first chart tracks total hours by channel, the second shows where exploitation time lands, and the third isolates the exploration slice.

Example — "Team Atlas" weekly stand-up. Week 1 the model leans exploratory: 42% of the 50-hour budget (≈21 hours) is earmarked for experiments, letting founder communities run a 24-hour sprint and events hold 10 hours despite thin data. By week 6 the newfound IOI data for founders kicks in; exploration falls to 12 hours, brokers climb to 15½, and events slip under 9 hours as their sampled ROI softens. Week 9 is the inflection point: events have been demoted to ~5 hours, content falls near 1 hour, and founders jump above 26 hours as the desk re-allocates proven payoff. Week 16 keeps only ~4 hours of exploration alive—almost all directed at cold outbound—while founders and brokers dominate exploitation. The exploit panel makes the rebalancing obvious, and the exploration panel tells the team where next week’s learning budget is going to land.

Zoomed out, this stacked area immediately surfaces the policy’s story: who owns the bulk of execution time in any given week, how quickly brokers and founders take over, and when the exploitation budget stabilizes.

The exploit view strips exploration away so you can reason about pure production capacity—helpful when coaching individual owners or checking load against downstream bandwidth.

Finally, the exploration panel shows whether the information budget is diversified or dominated by one uncertain channel, and how fast the floor shrinks as posteriors firm up.

Offline policy evaluation (before you ship)

You already have logs: tuples $(x_i, a_i, y_i, c_i, p_b(a_i \mid x_i))$ , where $p_b$ is the historical (behavior) policy that chose $a_i$ .

We want $\mathbb{E}[R(\pi)]$ for a new policy $\pi$ without re-running the past.

Inverse Propensity Scoring (IPS)

Single-pull case:

\widehat{V}_{\mathrm{IPS}}(\pi) = \frac{1}{n}\sum_{i=1}^{n} \frac{\mathbf{1}\{\pi(x_i)=a_i\}\, r_i}{p_b(a_i \mid x_i)}, \qquad r_i = \text{reward}(y_i,c_i).

Self-normalized IPS:

\widehat{V}_{\mathrm{SNIPS}}(\pi) = \frac{\sum_i w_i r_i}{\sum_i w_i}, \quad w_i = \frac{\mathbf{1}\{\pi(x_i)=a_i\}}{p_b(a_i \mid x_i)}.

Doubly robust (safer under model or logging noise)

With a reward model $\hat r(x,a)$ ,

\widehat{V}_{\mathrm{DR}}(\pi) = \frac{1}{n}\sum_{i=1}^{n} \left[ \hat r(x_i,\pi(x_i)) + \frac{\mathbf{1}\{\pi(x_i)=a_i\}}{p_b(a_i \mid x_i)} \big(r_i - \hat r(x_i,a_i)\big) \right].

Replay with budgets

For a knapsack $\sum c \le B$ per round, simulate rounds in order, let $\pi$ pick feasible sets under the logged contexts, accept only the items actually taken in history ( $a_i$ ), accrue their rewards with IPS/DR weights, and enforce the budget. This yields a conservative lower-bound estimate.

Replay comparisons matter before you ship. The DR curve lets you read, at a glance, how quickly Thompson Sampling walks away from baseline policies and how much variance to expect round by round.

Pitfalls called out on the figure: variance blow-ups when $p_b$ is tiny, covariate shift between history and now, and violations of SUTVA (channels interact).

What to ship each week (turn the math into a report)

“Channel Allocation” one-pager:

Budget & split. $B_t$ with an exploration floor $u_t B_t$ .
Allocation table. For each channel × sector × region: planned pulls, cost, expected meetings/IOIs/LOIs, and 95% credible bands.
Top-K ROI picks. The list that maximizes sampled net value per cost.
Exploration slot. Arms chosen by information gain with a sentence on why they are uncertain.
Calibration tiles.
- Avg log predictive density:
  $\frac{1}{K}\sum_{i=1}^{K}\log p_i(y_i).$
- Brier for “meeting set” and for “LOI”:
  $\frac{1}{K}\sum_{i=1}^K (p_i - y_i)^2.$
Guardrails. Coverage by sector/region, broker-fatigue limit, email volume cap.

Stage-by-stage visibility exposes where the learning budget is buying clarity: steep drop-offs or wide bands on IOI/LOI keep the experimentation flame lit, while tight funnels invite heavier exploitation.

Risks and how to monitor them

Concept drift (seasonality, macro, broker behavior). Use a discounted posterior or dynamic coefficients:
$\beta_{s,t} = \beta_{s,t-1} + \eta_{s,t},\quad \eta_{s,t}\sim\mathcal{N}(0,Q_s).$
Monitor rolling log score; when it drops, increase $u_t$ temporarily.
Strategic response. Brokers change behavior when they sense selection. Rate-limit per counterparty and include features that capture “recency of ask.”
Selection bias & missing counterfactuals. DR estimators and randomized exploration slots protect against path dependence.
Fairness / coverage. Add soft constraints: each week, require minimal mass in emerging sectors or regions.
Overfitting thin features. Hierarchical priors on per-channel intercepts and sector pools curb variance.

Posterior predictive checks keep the model honest. Matching histograms mean your simulated pipeline resembles reality; divergence flags the features or stages that need richer priors or fresh data.

Connections (how this plugs into a broader toolkit)

Ad-tech & A/B platforms. This is the same math as budgeted online advertising allocation with multi-armed bandits translating click-through → conversion funnels.
Credit underwriting. “Credit risk” ≈ LOI probability: proper scoring and drift handling carry over to probability-of-default models.
Operations research. The weekly selection is a knapsack problem under uncertainty; greedy on sampled ROI is a Monte Carlo heuristic from operations research that works shockingly well.
Portfolio theory. Exploration budget is a risk budget; posterior entropy echoes modern portfolio theory notions of diversification.

Appendix A — Minimal PyMC sketch (contextual logistic, one stage)

This toy fits meeting probability with a hierarchical intercept by channel and a shared slope on features. Extend the same pattern to IOI/LOI/Close or fit them jointly.

# PyMC 5+ sketch (illustrative; adapt to your data schema)
import pymc as pm
import aesara.tensor as at
import numpy as np

# Inputs
# X: (n_samples, p) contextual features (standardized)
# ch_idx: (n_samples,) integer channel index in [0, n_channels)
# y: (n_samples,) 0/1 for "meeting set"

n_channels = int(ch_idx.max() + 1)
p = X.shape[1]

with pm.Model() as m:
    # Hyperpriors
    mu_alpha = pm.Normal("mu_alpha", 0.0, 1.0)
    sigma_alpha = pm.HalfNormal("sigma_alpha", 1.0)

    # Channel intercepts (partial pooling)
    alpha_raw = pm.Normal("alpha_raw", 0.0, 1.0, shape=n_channels)
    alpha = pm.Deterministic("alpha", mu_alpha + alpha_raw * sigma_alpha)

    # Shared slopes on features
    beta = pm.Normal("beta", 0.0, 1.0, shape=p)

    # Linear predictor and logistic link
    eta = alpha[ch_idx] + at.dot(X, beta)
    p_meet = pm.Deterministic("p_meet", pm.math.sigmoid(eta))

    # Likelihood
    y_obs = pm.Bernoulli("y_obs", p=p_meet, observed=y)

    # Sample posterior
    idata = pm.sample(2000, tune=1000, target_accept=0.9, chains=4, random_seed=42)

# Thompson Sampling step (per decision round):
# 1) draw one posterior sample (alpha*, beta*)
# 2) compute p_meet for candidate contexts
# 3) if modeling deeper stages, draw their betas too and multiply stage probabilities
# 4) compute sampled EVs = V * product(p_stage) - cost
# 5) greedily fill the budget by EV/cost ratio

Notes.

Add region/sector intercepts with another level of pooling if helpful.
For deeper stages, either (i) fit separate models and multiply predictions, or (ii) model stages jointly with a multinomial/sequential likelihood.
To handle drift, let $(\mu_\alpha, \beta)$ follow a random walk in time and use a state-space model (e.g., pm.AR or custom evolution).

Appendix B — KPI dashboard sketch (what tiles to show)

Allocation vs Budget. Stacked bars for chosen channels; target $B_t$ line.
- Why: the policy is visible and auditable.
Funnel by Channel. Meeting/IOI/LOI means with 80% bands; dots for last week’s realized rates.
- Why: uncertainty is part of planning, not an afterthought.
Off-Policy Score. DR estimate of cumulative reward vs Baseline with a confidence ribbon.
- Why: proof the policy wins in expectation before a full go-live.
Exploration Spend. Line for $u_t B_t$ and realized exploration cost, with a 4-week moving average.
- Why: assures learning does not get starved.
Drift Monitor. Rolling log score for stage-1; alert when it falls below a threshold.
- Why: when the world shifts, the policy should react.
Coverage Heatmap. Channel × Sector × Region counts vs targets.
- Why: prevents blind spots and broker fatigue.

Closing

Treat origination as a budgeted Bayesian bandit:

Model stage probabilities with partial pooling.
Sample a plausible world; rank by sampled ROI; fill the budget.
Keep an exploration floor tied to information gain.
Score offline with DR estimators; monitor drift online.
Ship a short, credible weekly allocation that compounds learning.

That is how IOIs and LOIs climb while your network expands: explore on purpose, exploit with proof.