JAN10
SAT2026

Tool Calling Is Just Function Composition

With uncertainty: (gf)(x)=g(f(x))(g \circ f)(x) = g(f(x)) becomes E[g(Y)Yf(x)]\mathbb{E}[g(Y) \mid Y \sim f(x)]
agentsfunctional programmingmonadsmcpoodauncertaintycompositionunixrust

In my previous post on agents, I framed intelligence as control under uncertainty: an agent maximizing expected return over a POMDP. That framing is correct, but it treats tool calling as a black box — "the action space gets bigger."

This post zooms in on the structure of tool calling itself. The punchline:

Tool calling is function composition. The "with uncertainty" part is what makes it a monad.

If you've written Haskell or Rust, this will feel familiar. If you haven't, don't worry — we'll build up from pure functions and see why the monadic structure emerges naturally when things can fail.


Pure composition: the happy path

In functional programming, composition is the bread and butter:

(gf)(x)=g(f(x))(g \circ f)(x) = g(f(x))

You pipe outputs to inputs. Types line up. The world is clean.

A three-tool pipeline looks like:

h=T3T2T1h = T_3 \circ T_2 \circ T_1

If each Ti:XiXi+1T_i: X_i \to X_{i+1}, then h:X1X4h: X_1 \to X_4. Simple.

x
T₁
y₁
T₂
y₂
T₃
y₃

This is the mental model most people have of agent pipelines. "Call tool A, get result, call tool B, get result, ..."

But real tools don't work this way. They fail. They timeout. They return garbage. They cost money. The structure of composition survives, but wrapped in uncertainty.


The Unix philosophy: small tools, big pipes

Before we get to monads, let's look at an older precedent. The Unix philosophy says: write programs that do one thing well, and connect them via pipes.

cat server.log | grep "ERROR" | cut -d' ' -f3 | sort | uniq -c | sort -rn | head -10

This pipeline:

  1. cat — reads the file (one thing)
  2. grep — filters lines (one thing)
  3. cut — extracts fields (one thing)
  4. sort — sorts lines (one thing)
  5. uniq -c — counts duplicates (one thing)
  6. sort -rn — sorts numerically descending (one thing)
  7. head — takes first N (one thing)

Each tool is tiny and composable. The pipe | is the composition operator. The shell handles the plumbing — buffering, process management, signal handling.

stdout
stdout
stdout
stdout
stdout
stdout
server.log
cat
grep ERROR
cut -d' ' -f3
sort
uniq -c
sort -rn
head -10
top 10 errors

Sound familiar? MCP tools are the same pattern:

  • Each tool does one thing well (query database, send email, execute code)
  • The agent orchestrator is the shell
  • Tool calls are connected via the agent's reasoning

The difference: Unix pipes are (mostly) deterministic. Tool calls can fail, timeout, or return unexpected results. We need machinery to handle that.


When tools fail: Enter the monad

Real tool calling is:

Ti:XiEither(Error,Yi)T_i: X_i \to \text{Either}(\text{Error}, Y_i)

The tool might succeed and return YiY_i, or fail and return an error. Now composition breaks — you can't just feed an Either into a function expecting a raw value.

The fix is monadic bind (>>= in Haskell, and_then in Rust, flatMap in Scala, .then() in JS promises):

( ⁣ ⁣=):Either(E,A)(AEither(E,B))Either(E,B)(\gg\!\!=): \text{Either}(E, A) \to (A \to \text{Either}(E, B)) \to \text{Either}(E, B)

In words: "If the first computation succeeded, unwrap the value and feed it to the next computation. If it failed, propagate the error."

Haskell

pipeline :: Input -> Either Error Output
pipeline x = tool1 x >>= tool2 >>= tool3

-- The >>= ("bind") operator handles the plumbing:
-- if tool1 fails, short-circuit
-- if tool1 succeeds, feed result to tool2

Rust

Rust makes this beautiful with the ? operator. Result<T, E> is Rust's Either:

fn pipeline(input: Input) -> Result<Output, Error> {
    let a = tool1(input)?;   // Early return on Err
    let b = tool2(a)?;       // Early return on Err
    let c = tool3(b)?;       // Early return on Err
    Ok(c)
}

The ? is syntactic sugar for "if this is Err, return early; if it's Ok, unwrap and continue." It's monadic bind with better ergonomics.

Rust also has Option<T> for "might not exist" (like Haskell's Maybe):

fn find_tool_config(name: &str) -> Option<Config> {
    let registry = load_registry()?;      // None propagates
    let entry = registry.get(name)?;      // None propagates
    let config = entry.parse_config()?;   // None propagates
    Some(config)
}

The ? works on both Result and Option. Same pattern, different error types.

The pattern

Ok / Right
Err / Left
Ok / Right
Err / Left
Ok / Right
Err / Left
x
T₁
T₂
Error e
T₃
Error e
Success
Error e

This is exactly what MCP tool calling does under the hood. Each tool call returns a result or an error. The agent (or framework) decides whether to continue, retry, or bail.

The monad laws (left identity, right identity, associativity) guarantee that composition is well-behaved — you can refactor pipelines without changing semantics. This is why functional programmers care about monads: they're a disciplined way to handle effects.


Reliability compounds multiplicatively

Here's where it gets quantitative. If each tool TiT_i succeeds with probability pip_i, and failures are independent, the pipeline success probability is:

P(pipeline success)=i=1npiP(\text{pipeline success}) = \prod_{i=1}^{n} p_i

This is brutal. Five tools at 95% reliability each:

0.9550.770.95^5 \approx 0.77

You've lost 23% of your runs to failures somewhere in the chain.

Interactive — Pipeline composition

Pipeline controls
Success rate: 95.1%
Theory (no retry): 63.2%
Theory (with retry): 95.9%
E[V] = 88.7

Each tool in the pipeline has reliability pi = p0(1-decay)i. The theoretical pipeline reliability is the product: P(success) = ∏pi. With k retries, each tool succeeds with probability 1 - (1-pi)k+1.

Play with the parameters above. Notice how:

  • More tools → lower success rate (multiplicative decay)
  • Retries help, but cost money and time
  • Reliability decay along the pipeline (later tools less reliable) shifts failure mass toward the end
Composition as funnel
T1T2T3T4T5020406080100020406080100
Theoretical P(reach)Observed P(reach)Individual reliabilityPipeline Reliability FunnelTool in pipelineCumulative P(reach) %Individual reliability %

The dashed line shows theoretical cumulative reliability: P(reach Tk) = ∏i≤k pi. The solid line shows observed reach from Monte Carlo. Each bar shows individual tool reliability.

T1: 100% @ 106ms
T2: 99% @ 119ms
T3: 99% @ 130ms
T4: 98% @ 149ms
T5: 98% @ 166ms

The funnel chart shows cumulative reach: what fraction of runs make it to each stage. The theoretical line assumes independence; the observed line comes from Monte Carlo simulation.


Uncertainty propagation: where do failures cluster?

The monadic view gives us structure; Monte Carlo gives us numbers. Let's look at the distribution of outcomes.

Uncertainty propagation
2004006008001000050100
SuccessFailureLatency DistributionTotal latency (ms)Count
50010002468
SuccessFailureLatency vs CostLatency (ms)Cost
T1T2T3T4T50102030
Where Failures OccurTool% of failures
Latency percentiles
p50: 650ms
p95: 879ms
p99: 997ms

Mean: 661ms ± 124ms

Mean cost: 5.42 units

Expected value: 88.7

E[V] = P(success) × 100 − P(fail) × 20 − E[cost]

Failures cluster early when reliability decays along the pipeline. The latency distribution shows how variance compounds through composition. Cost includes retries — more retries means higher cost on failure paths.

Key observations:

  • Latency is right-skewed — retries add mass to the tail
  • Failures cluster early when reliability decays along the pipeline
  • Cost correlates with latency — failed attempts still cost money
  • Expected value captures the tradeoff: P(success) × reward − P(fail) × penalty − E[cost]

This is the same variance decomposition logic from hierarchical Bayes: total variance = sum of component variances, but here the components are pipeline stages rather than manager/fund/deal levels.


Shrinkage for tool reliability: a worked example

Here's where the hierarchical Bayes connection gets concrete.

Say you're running an agent that calls tools from three vendors: vendor_A, vendor_B, vendor_C. Each vendor provides multiple tools. You've observed some success/failure data:

ToolVendorCallsSuccessesRaw rate
A.queryA504794%
A.writeA121192%
A.deleteA33100%
B.queryB20018693%
B.writeB807189%
C.queryC8675%

The problem: Should you trust that A.delete is 100% reliable? That C.query is only 75%?

No. The sample sizes are tiny. A.delete has 3 observations — it could easily fail 10% of the time and you just got lucky. C.query might be fine; 6/8 is within normal variance of a 90% tool.

The fix: shrinkage. Instead of using raw rates, pool information hierarchically:

p^tool=wpˉraw+(1w)p^vendor\hat{p}_{\text{tool}} = w \cdot \bar{p}_{\text{raw}} + (1 - w) \cdot \hat{p}_{\text{vendor}}

where the weight ww depends on sample size:

w=n/σ2n/σ2+1/τ2w = \frac{n / \sigma^2}{n / \sigma^2 + 1 / \tau^2}
  • nn = number of observations for this tool
  • σ2\sigma^2 = observation noise (binomial variance)
  • τ2\tau^2 = variance across tools within a vendor

Small nn → small ww → shrink toward vendor mean. Large nn → large ww → trust the data.

For our example, if vendor A's pooled rate is ~94% and vendor C's is ~85%:

ToolRaw rateShrunken estimateWhy
A.delete100%~95%Shrink toward vendor A mean (n=3 is tiny)
C.query75%~82%Shrink toward vendor C mean (n=8 is small)
B.query93%~93%Barely shrinks (n=200 is plenty)

This is the same math from Borrowing Predictive Strength, but applied to tool reliability instead of fund returns. The hierarchy is:

ptoolBeta(αvendor,βvendor)p_{\text{tool}} \sim \text{Beta}(\alpha_{\text{vendor}}, \beta_{\text{vendor}})(αvendor,βvendor)prior from all vendors(\alpha_{\text{vendor}}, \beta_{\text{vendor}}) \sim \text{prior from all vendors}

The agent that uses shrunken estimates will make better decisions than one that trusts raw rates. It won't over-rely on tools with suspiciously high rates from tiny samples, and it won't abandon tools that had a few bad runs.


The OODA loop: why faster feedback wins

John Boyd was a fighter pilot and military strategist. His key insight: the side that cycles through Observe-Orient-Decide-Act faster wins, even with worse individual components.

feedback
Observe
Orient
Decide
Act
PhaseWhat happensAgent equivalent
ObserveGather data from environmentAPI calls, sensor reads, user input
OrientUpdate mental model of realityLLM processes context, updates beliefs
DecideChoose action from optionsPolicy selects tool + arguments
ActExecute the decisionMCP tool call

The loop repeats. Each iteration updates your model and takes action. Faster loops = more iterations = better adaptation.

Why this matters for engineering organizations

Boyd's insight applies beyond dogfights. Consider two engineering teams:

Team Slow (monthly deploys):

  • Observe: collect metrics monthly
  • Orient: analyze in quarterly reviews
  • Decide: plan features for next quarter
  • Act: deploy once a month
  • Cycle time: ~90 days

Team Fast (continuous deployment):

  • Observe: real-time monitoring, feature flags
  • Orient: daily standups, instant dashboards
  • Decide: small batch decisions, A/B tests
  • Act: deploy multiple times per day
  • Cycle time: ~1 day

Team Fast runs 90× more OODA cycles per quarter. They:

  • Detect problems faster (shorter observe latency)
  • Update understanding faster (shorter orient latency)
  • Course-correct faster (shorter decide-act latency)
  • Learn faster (more iterations through the loop)

This is why CI/CD wins. It's why feature flags beat big-bang releases. It's why startups can outmaneuver incumbents: they're operating inside the incumbent's OODA loop.

For AI agents, the same logic applies

An agent with:

  • Faster inference → more OODA cycles per task
  • Better observation tools → lower observe noise
  • Better world model → lower orient noise
  • Better policy → lower decide noise
  • More reliable tools → lower act noise

The compound effect is huge. An agent running 10 OODA cycles with 80% per-cycle accuracy outperforms one running 2 cycles with 95% accuracy:

P(converge to good state)=1(1p)cyclesP(\text{converge to good state}) = 1 - (1 - p)^{\text{cycles}}

10 cycles at 80%: 10.21099.99%1 - 0.2^{10} \approx 99.99\%

2 cycles at 95%: 10.052=99.75%1 - 0.05^{2} = 99.75\%

More iterations beat higher per-iteration accuracy. Speed compounds.

OODA Loop: Latency + Uncertainty

Latency (ms)

Uncertainty (σ)

Cycle time: 380ms
Total σ: 29.1%

Faster loops enable quicker adaptation. Lower uncertainty means more reliable state estimates.

Latency shape

ObserveOrientDecideAct020406080100120140
ObserveOrientDecideActTotal0102030
Uncertainty CompoundingCumulative σ (%)

Observe: Gather sensor/API data. Orient: Update world model (LLM inference). Decide: Select action (policy evaluation). Act: Execute tool call. Uncertainty compounds through the loop: σtotal² ≈ σO² + (1+σO²)(σR² + ...).

The uncertainty compounding formula is:

σtotal2σO2+(1+σO2)[σR2+(1+σR2)[σD2+(1+σD2)σA2]]\sigma_{\text{total}}^2 \approx \sigma_O^2 + (1 + \sigma_O^2)\left[\sigma_R^2 + (1 + \sigma_R^2)\left[\sigma_D^2 + (1 + \sigma_D^2)\sigma_A^2\right]\right]

This is not simple addition — later stages amplify earlier uncertainty because they operate on corrupted inputs. A 10% error in observation can become a 30% error in action after passing through a noisy world model and policy.

The cure: faster loops let you correct errors before they compound too far. Each new observation partially resets the error accumulation.


Connecting the threads

Let's tie this back to the POMDP framing. There we had:

atπ(ht),ht=(o1,a1,,ot)a_t \sim \pi(\cdot \mid h_t), \quad h_t = (o_1, a_1, \ldots, o_t)

Now we can be more precise about what "select action" means when ata_t is a tool call:

  1. Action selection is choosing which tool kk to call with which arguments xkx_k
  2. Execution is the stochastic map Tk:xkResult(Error,yk)T_k: x_k \mapsto \text{Result}(\text{Error}, y_k)
  3. Composition is chaining multiple tool calls via monadic bind (Rust's ?, Haskell's >>=)
  4. Uncertainty propagates through the chain multiplicatively

The agent's job is to choose which composition to attempt, given:

  • Estimated reliability of each tool (use shrinkage!)
  • Latency and cost constraints
  • Value of success vs. cost of failure
  • How many OODA cycles it can afford

The Lisp connection

In Cyborg Lisps, I wrote about embedded Lisps in host languages — Clojure on the JVM, Hy on Python, Fennel on Lua. The appeal is metaprogramming: code that writes code, macros that transform syntax.

Tool calling has the same flavor. A tool schema (MCP's typed interface) is like a function signature. A tool call is like a function application. An agent orchestrator is like a macro system that generates and executes tool-calling code at runtime.

The difference is uncertainty. Macros expand deterministically (or fail to compile). Tool calls succeed probabilistically. The monadic wrapper handles what macros can't: runtime failure, retry logic, fallback strategies.

You could imagine a language where tool calls are first-class and composition is syntactically supported:

// Hypothetical Rust-like agent DSL
let result = tool1(x)?
    .retry(3)
    .timeout(Duration::from_secs(5))
    .fallback(|| default_value)
    .and_then(tool2)?
    .and_then(tool3)?;

Some agent frameworks are converging on this. The functional programming community got here decades ago with IO, Either, Maybe. Rust brought it to systems programming with Result and Option. Agents are rediscovering the same abstractions.


Takeaways

  1. Tool calling is function composition with failure modes — Unix pipes, Haskell's >>=, Rust's ? are all the same pattern
  2. Monads (Result, Option, Either, Maybe) are the right abstraction for handling effects and failures
  3. Reliability compounds multiplicatively — five 95% tools give you 77% end-to-end
  4. Use shrinkage to estimate tool reliability — don't trust raw rates from small samples
  5. OODA loops explain why faster feedback wins — more cycles beat higher per-cycle accuracy
  6. Engineering orgs with faster OODA loops (CI/CD, feature flags) learn faster than slow-cycle competitors
  7. Uncertainty compounds through the loop, but faster iterations let you correct before errors snowball

The agent revolution isn't inventing new math. It's applying old math — control theory, functional programming, Bayesian inference, Unix philosophy — to a new substrate: LLMs connected to tools via typed protocols.

The math doesn't care whether you're composing Haskell functions, Rust futures, Unix pipes, or MCP tool calls. It's the same diagram, the same laws, the same failure modes. That's what makes it beautiful.


Further reading