JAN10

SAT2026

Tool Calling Is Just Function Composition

With uncertainty:

(g \circ f)(x) = g(f(x))

becomes

\mathbb{E}[g(Y) \mid Y \sim f(x)]

agentsfunctional programmingmonadsmcpoodauncertaintycompositionunixrust

In my previous post on agents, I framed intelligence as control under uncertainty: an agent maximizing expected return over a POMDP. That framing is correct, but it treats tool calling as a black box — "the action space gets bigger."

This post zooms in on the structure of tool calling itself. The punchline:

Tool calling is function composition. The "with uncertainty" part is what makes it a monad.

If you've written Haskell or Rust, this will feel familiar. If you haven't, don't worry — we'll build up from pure functions and see why the monadic structure emerges naturally when things can fail.

Pure composition: the happy path

In functional programming, composition is the bread and butter:

(g \circ f)(x) = g(f(x))

You pipe outputs to inputs. Types line up. The world is clean.

A three-tool pipeline looks like:

h = T_3 \circ T_2 \circ T_1

If each $T_i: X_i \to X_{i+1}$ , then $h: X_1 \to X_4$ . Simple.

This is the mental model most people have of agent pipelines. "Call tool A, get result, call tool B, get result, ..."

But real tools don't work this way. They fail. They timeout. They return garbage. They cost money. The structure of composition survives, but wrapped in uncertainty.

The Unix philosophy: small tools, big pipes

Before we get to monads, let's look at an older precedent. The Unix philosophy says: write programs that do one thing well, and connect them via pipes.

cat server.log | grep "ERROR" | cut -d' ' -f3 | sort | uniq -c | sort -rn | head -10

This pipeline:

cat — reads the file (one thing)
grep — filters lines (one thing)
cut — extracts fields (one thing)
sort — sorts lines (one thing)
uniq -c — counts duplicates (one thing)
sort -rn — sorts numerically descending (one thing)
head — takes first N (one thing)

Each tool is tiny and composable. The pipe | is the composition operator. The shell handles the plumbing — buffering, process management, signal handling.

Sound familiar? MCP tools are the same pattern:

Each tool does one thing well (query database, send email, execute code)
The agent orchestrator is the shell
Tool calls are connected via the agent's reasoning

The difference: Unix pipes are (mostly) deterministic. Tool calls can fail, timeout, or return unexpected results. We need machinery to handle that.

When tools fail: Enter the monad

Real tool calling is:

T_i: X_i \to \text{Either}(\text{Error}, Y_i)

The tool might succeed and return $Y_i$ , or fail and return an error. Now composition breaks — you can't just feed an Either into a function expecting a raw value.

The fix is monadic bind (>>= in Haskell, and_then in Rust, flatMap in Scala, .then() in JS promises):

(\gg\!\!=): \text{Either}(E, A) \to (A \to \text{Either}(E, B)) \to \text{Either}(E, B)

In words: "If the first computation succeeded, unwrap the value and feed it to the next computation. If it failed, propagate the error."

Haskell

pipeline :: Input -> Either Error Output
pipeline x = tool1 x >>= tool2 >>= tool3

-- The >>= ("bind") operator handles the plumbing:
-- if tool1 fails, short-circuit
-- if tool1 succeeds, feed result to tool2

Rust

Rust makes this beautiful with the ? operator. Result<T, E> is Rust's Either:

fn pipeline(input: Input) -> Result<Output, Error> {
    let a = tool1(input)?;   // Early return on Err
    let b = tool2(a)?;       // Early return on Err
    let c = tool3(b)?;       // Early return on Err
    Ok(c)
}

The ? is syntactic sugar for "if this is Err, return early; if it's Ok, unwrap and continue." It's monadic bind with better ergonomics.

Rust also has Option<T> for "might not exist" (like Haskell's Maybe):

fn find_tool_config(name: &str) -> Option<Config> {
    let registry = load_registry()?;      // None propagates
    let entry = registry.get(name)?;      // None propagates
    let config = entry.parse_config()?;   // None propagates
    Some(config)
}

The ? works on both Result and Option. Same pattern, different error types.

The pattern

This is exactly what MCP tool calling does under the hood. Each tool call returns a result or an error. The agent (or framework) decides whether to continue, retry, or bail.

The monad laws (left identity, right identity, associativity) guarantee that composition is well-behaved — you can refactor pipelines without changing semantics. This is why functional programmers care about monads: they're a disciplined way to handle effects.

Reliability compounds multiplicatively

Here's where it gets quantitative. If each tool $T_i$ succeeds with probability $p_i$ , and failures are independent, the pipeline success probability is:

P(\text{pipeline success}) = \prod_{i=1}^{n} p_i

This is brutal. Five tools at 95% reliability each:

0.95^5 \approx 0.77

You've lost 23% of your runs to failures somewhere in the chain.

Interactive — Pipeline composition

Pipeline controls

Number of tools (n)

Base reliability (%)

Reliability decay (%)

Retries per tool

Base latency (ms)

Latency growth (%)

Cost per call

Simulations

Success rate: 95.1%

Theory (no retry): 63.2%

Theory (with retry): 95.9%

E[V] = 88.7

Each tool in the pipeline has reliability p_i = p₀(1-decay)ⁱ. The theoretical pipeline reliability is the product: P(success) = ∏p_i. With k retries, each tool succeeds with probability 1 - (1-p_i)^k+1.

Play with the parameters above. Notice how:

More tools → lower success rate (multiplicative decay)
Retries help, but cost money and time
Reliability decay along the pipeline (later tools less reliable) shifts failure mass toward the end

Composition as funnel

The dashed line shows theoretical cumulative reliability: P(reach T_k) = ∏_i≤k p_i. The solid line shows observed reach from Monte Carlo. Each bar shows individual tool reliability.

T1: 100% @ 106ms

T2: 99% @ 119ms

T3: 99% @ 130ms

T4: 98% @ 149ms

T5: 98% @ 166ms

The funnel chart shows cumulative reach: what fraction of runs make it to each stage. The theoretical line assumes independence; the observed line comes from Monte Carlo simulation.

Uncertainty propagation: where do failures cluster?

The monadic view gives us structure; Monte Carlo gives us numbers. Let's look at the distribution of outcomes.

Uncertainty propagation

Latency percentiles

p50: 650ms

p95: 879ms

p99: 997ms

Mean: 661ms ± 124ms

Mean cost: 5.42 units

Expected value: 88.7

E[V] = P(success) × 100 − P(fail) × 20 − E[cost]

Failures cluster early when reliability decays along the pipeline. The latency distribution shows how variance compounds through composition. Cost includes retries — more retries means higher cost on failure paths.

Key observations:

Latency is right-skewed — retries add mass to the tail
Failures cluster early when reliability decays along the pipeline
Cost correlates with latency — failed attempts still cost money
Expected value captures the tradeoff: P(success) × reward − P(fail) × penalty − E[cost]

This is the same variance decomposition logic from hierarchical Bayes: total variance = sum of component variances, but here the components are pipeline stages rather than manager/fund/deal levels.

Shrinkage for tool reliability: a worked example

Here's where the hierarchical Bayes connection gets concrete.

Say you're running an agent that calls tools from three vendors: vendor_A, vendor_B, vendor_C. Each vendor provides multiple tools. You've observed some success/failure data:

Tool	Vendor	Calls	Successes	Raw rate
`A.query`	A	50	47	94%
`A.write`	A	12	11	92%
`A.delete`	A	3	3	100%
`B.query`	B	200	186	93%
`B.write`	B	80	71	89%
`C.query`	C	8	6	75%

The problem: Should you trust that A.delete is 100% reliable? That C.query is only 75%?

No. The sample sizes are tiny. A.delete has 3 observations — it could easily fail 10% of the time and you just got lucky. C.query might be fine; 6/8 is within normal variance of a 90% tool.

The fix: shrinkage. Instead of using raw rates, pool information hierarchically:

\hat{p}_{\text{tool}} = w \cdot \bar{p}_{\text{raw}} + (1 - w) \cdot \hat{p}_{\text{vendor}}

where the weight $w$ depends on sample size:

w = \frac{n / \sigma^2}{n / \sigma^2 + 1 / \tau^2}

$n$ = number of observations for this tool
$\sigma^2$ = observation noise (binomial variance)
$\tau^2$ = variance across tools within a vendor

Small $n$ → small $w$ → shrink toward vendor mean. Large $n$ → large $w$ → trust the data.

For our example, if vendor A's pooled rate is ~94% and vendor C's is ~85%:

Tool	Raw rate	Shrunken estimate	Why
`A.delete`	100%	~95%	Shrink toward vendor A mean (n=3 is tiny)
`C.query`	75%	~82%	Shrink toward vendor C mean (n=8 is small)
`B.query`	93%	~93%	Barely shrinks (n=200 is plenty)

This is the same math from Borrowing Predictive Strength, but applied to tool reliability instead of fund returns. The hierarchy is:

p_{\text{tool}} \sim \text{Beta}(\alpha_{\text{vendor}}, \beta_{\text{vendor}})

(\alpha_{\text{vendor}}, \beta_{\text{vendor}}) \sim \text{prior from all vendors}

The agent that uses shrunken estimates will make better decisions than one that trusts raw rates. It won't over-rely on tools with suspiciously high rates from tiny samples, and it won't abandon tools that had a few bad runs.

The OODA loop: why faster feedback wins

John Boyd was a fighter pilot and military strategist. His key insight: the side that cycles through Observe-Orient-Decide-Act faster wins, even with worse individual components.

Phase	What happens	Agent equivalent
Observe	Gather data from environment	API calls, sensor reads, user input
Orient	Update mental model of reality	LLM processes context, updates beliefs
Decide	Choose action from options	Policy selects tool + arguments
Act	Execute the decision	MCP tool call

The loop repeats. Each iteration updates your model and takes action. Faster loops = more iterations = better adaptation.

Why this matters for engineering organizations

Boyd's insight applies beyond dogfights. Consider two engineering teams:

Team Slow (monthly deploys):

Observe: collect metrics monthly
Orient: analyze in quarterly reviews
Decide: plan features for next quarter
Act: deploy once a month
Cycle time: ~90 days

Team Fast (continuous deployment):

Observe: real-time monitoring, feature flags
Orient: daily standups, instant dashboards
Decide: small batch decisions, A/B tests
Act: deploy multiple times per day
Cycle time: ~1 day

Team Fast runs 90× more OODA cycles per quarter. They:

Detect problems faster (shorter observe latency)
Update understanding faster (shorter orient latency)
Course-correct faster (shorter decide-act latency)
Learn faster (more iterations through the loop)

This is why CI/CD wins. It's why feature flags beat big-bang releases. It's why startups can outmaneuver incumbents: they're operating inside the incumbent's OODA loop.

For AI agents, the same logic applies

An agent with:

Faster inference → more OODA cycles per task
Better observation tools → lower observe noise
Better world model → lower orient noise
Better policy → lower decide noise
More reliable tools → lower act noise

The compound effect is huge. An agent running 10 OODA cycles with 80% per-cycle accuracy outperforms one running 2 cycles with 95% accuracy:

P(\text{converge to good state}) = 1 - (1 - p)^{\text{cycles}}

10 cycles at 80%: $1 - 0.2^{10} \approx 99.99\%$

2 cycles at 95%: $1 - 0.05^{2} = 99.75\%$

More iterations beat higher per-iteration accuracy. Speed compounds.

OODA Loop: Latency + Uncertainty

Latency (ms)

Observe

Orient

Decide

Act

Uncertainty (σ)

Observe σ

Orient σ

Decide σ

Act σ

Cycle time: 380ms

Total σ: 29.1%

Faster loops enable quicker adaptation. Lower uncertainty means more reliable state estimates.

Latency shape

Observe: Gather sensor/API data. Orient: Update world model (LLM inference). Decide: Select action (policy evaluation). Act: Execute tool call. Uncertainty compounds through the loop: σ_total² ≈ σ_O² + (1+σ_O²)(σ_R² + ...).

The uncertainty compounding formula is:

\sigma_{\text{total}}^2 \approx \sigma_O^2 + (1 + \sigma_O^2)\left[\sigma_R^2 + (1 + \sigma_R^2)\left[\sigma_D^2 + (1 + \sigma_D^2)\sigma_A^2\right]\right]

This is not simple addition — later stages amplify earlier uncertainty because they operate on corrupted inputs. A 10% error in observation can become a 30% error in action after passing through a noisy world model and policy.

The cure: faster loops let you correct errors before they compound too far. Each new observation partially resets the error accumulation.

Connecting the threads

Let's tie this back to the POMDP framing. There we had:

a_t \sim \pi(\cdot \mid h_t), \quad h_t = (o_1, a_1, \ldots, o_t)

Now we can be more precise about what "select action" means when $a_t$ is a tool call:

Action selection is choosing which tool $k$ to call with which arguments $x_k$
Execution is the stochastic map $T_k: x_k \mapsto \text{Result}(\text{Error}, y_k)$
Composition is chaining multiple tool calls via monadic bind (Rust's ?, Haskell's >>=)
Uncertainty propagates through the chain multiplicatively

The agent's job is to choose which composition to attempt, given:

Estimated reliability of each tool (use shrinkage!)
Latency and cost constraints
Value of success vs. cost of failure
How many OODA cycles it can afford

The Lisp connection

In Cyborg Lisps, I wrote about embedded Lisps in host languages — Clojure on the JVM, Hy on Python, Fennel on Lua. The appeal is metaprogramming: code that writes code, macros that transform syntax.

Tool calling has the same flavor. A tool schema (MCP's typed interface) is like a function signature. A tool call is like a function application. An agent orchestrator is like a macro system that generates and executes tool-calling code at runtime.

The difference is uncertainty. Macros expand deterministically (or fail to compile). Tool calls succeed probabilistically. The monadic wrapper handles what macros can't: runtime failure, retry logic, fallback strategies.

You could imagine a language where tool calls are first-class and composition is syntactically supported:

// Hypothetical Rust-like agent DSL
let result = tool1(x)?
    .retry(3)
    .timeout(Duration::from_secs(5))
    .fallback(|| default_value)
    .and_then(tool2)?
    .and_then(tool3)?;

Some agent frameworks are converging on this. The functional programming community got here decades ago with IO, Either, Maybe. Rust brought it to systems programming with Result and Option. Agents are rediscovering the same abstractions.

Takeaways

Tool calling is function composition with failure modes — Unix pipes, Haskell's >>=, Rust's ? are all the same pattern
Monads (Result, Option, Either, Maybe) are the right abstraction for handling effects and failures
Reliability compounds multiplicatively — five 95% tools give you 77% end-to-end
Use shrinkage to estimate tool reliability — don't trust raw rates from small samples
OODA loops explain why faster feedback wins — more cycles beat higher per-cycle accuracy
Engineering orgs with faster OODA loops (CI/CD, feature flags) learn faster than slow-cycle competitors
Uncertainty compounds through the loop, but faster iterations let you correct before errors snowball

The agent revolution isn't inventing new math. It's applying old math — control theory, functional programming, Bayesian inference, Unix philosophy — to a new substrate: LLMs connected to tools via typed protocols.

The math doesn't care whether you're composing Haskell functions, Rust futures, Unix pipes, or MCP tool calls. It's the same diagram, the same laws, the same failure modes. That's what makes it beautiful.