JAN16

FRI2026

Calibration: Are Your Probabilities Honest?

Reliability diagrams, Brier scores, and the sharpness-calibration tradeoff

bayesianstatisticscalibrationpredictiondecision-makingml

When a weather forecaster says "70% chance of rain," we expect rain about 70% of the time they make that prediction. When a PE deal model outputs "60% probability of close," we expect 60% of such deals to actually close. This correspondence between stated probability and observed frequency is calibration.

A calibrated forecaster is honest: their probabilities mean what they say. An overconfident forecaster claims 90% when the true rate is 70%. An underconfident one hedges toward 50% even when outcomes are predictable. Both are miscalibrated—their stated probabilities don't match reality.

This post covers:

Reliability diagrams — visualizing calibration gaps
Proper scoring rules — why Brier score and log loss incentivize honesty
Brier decomposition — separating calibration from discrimination
Recalibration methods — Platt scaling and isotonic regression
The sharpness-calibration tradeoff — why both matter for decisions

If you've used shrinkage estimators (James-Stein, hierarchical models), calibration is the natural next question: once you have a point estimate, how confident should you be?

What Calibration Means

Formally, a probabilistic forecast $p$ is calibrated if:

\mathbb{P}(Y = 1 \mid p = x) = x \quad \forall x \in [0, 1]

Among all the times you predict probability $x$ , the fraction of positives should be $x$ .

Examples of calibration:

A 70% prediction should be right 70% of the time
A 30% prediction should be right 30% of the time
A 95% prediction should be wrong 1 in 20 times

Examples of miscalibration:

Claiming 90% when you're right 70% of the time → overconfident
Claiming 50% when you're right 80% of the time → underconfident
High predictions being too high, low predictions being too low → systematic overconfidence

The last pattern is most common in ML models and human forecasters alike. We tend to be overconfident at both extremes.

The Reliability Diagram

The reliability diagram (or calibration curve) is the primary tool for assessing calibration. It plots:

X-axis: Mean predicted probability in each bin
Y-axis: Actual fraction of positives in that bin

A perfectly calibrated model lies on the diagonal. Deviations show miscalibration.

Reliability Diagram

A perfectly calibrated model lies on the diagonal: when it predicts 70% probability, 70% of those cases should be positive. The gap between the curve and diagonal shows miscalibration. Marker size indicates bin count.

Number of Bins: 10

ECE: 10.2%

Brier: 0.172

Expected Calibration Error

10.24%

Max Calibration Error

16.53%

Sharpness (Pred. Variance)

0.1484

Reading the diagram: Points above the diagonal = underconfident (actual frequency > predicted). Points below = overconfident (actual frequency < predicted). The shaded red area quantifies the total calibration error.

Reading the diagram:

Points above the diagonal: Underconfident. The model predicts lower probabilities than realized frequencies.
Points below the diagonal: Overconfident. The model predicts higher probabilities than realized frequencies.
Marker size: Number of predictions in that bin. Larger markers are more reliable.
Error bars: 95% confidence intervals from binomial sampling. Overlap with the diagonal suggests adequate calibration.

The histogram on the right shows the prediction distribution—how often the model makes different confidence levels of predictions. A model that always predicts 50% is calibrated but useless. We want sharpness: predictions spread across the probability range.

Calibration Metrics

Expected Calibration Error (ECE)

The most common summary metric. Bin predictions, compute the gap between mean prediction and actual frequency, weight by bin size:

\text{ECE} = \sum_{b=1}^B \frac{n_b}{n} \cdot |\bar{p}_b - \bar{y}_b|

where $\bar{p}_b$ is mean predicted probability in bin $b$ and $\bar{y}_b$ is the actual positive rate.

Interpretation: ECE is the expected absolute difference between confidence and accuracy.

Maximum Calibration Error (MCE)

The worst-case gap across bins:

\text{MCE} = \max_b |\bar{p}_b - \bar{y}_b|

Useful when you care about worst-case miscalibration (e.g., safety-critical applications).

Brier Score

A proper scoring rule that directly measures probabilistic accuracy:

\text{Brier} = \frac{1}{n} \sum_{i=1}^n (p_i - y_i)^2

Lower is better. Range: 0 (perfect) to 1 (maximally wrong).

Why proper? A scoring rule is "proper" if the forecaster maximizes expected score by reporting true beliefs. The Brier score punishes both miscalibration and lack of discrimination. It's the MSE of probability forecasts.

Brier Score Decomposition

The Brier score decomposes into three interpretable components:

\text{Brier} = \underbrace{\text{Calibration}}_{\text{bad}} - \underbrace{\text{Refinement}}_{\text{good}} + \underbrace{\text{Uncertainty}}_{\text{irreducible}}

Calibration (reliability): How close bin means are to actual frequencies. Lower is better.
Refinement (resolution): How much bin means vary around the base rate. Higher is better—it means the model separates outcomes.
Uncertainty: $\bar{y}(1-\bar{y})$ , the variance from the base rate. Fixed by the data.

This decomposition reveals that a model can have a good Brier score by being well-calibrated or by having high resolution. The ideal is both.

Brier Score Decomposition

The Brier score decomposes into three components: Calibration (error from miscalibrated probabilities), Refinement (ability to separate outcomes), and Uncertainty (irreducible noise from base rate).

Miscalibration Level: 0.40

Perfect

Moderate

Severe

Brier Score: 0.1718

Calibration Component

0.0124

(Lower is better)

Refinement Component

0.0888

(Higher is better)

Base Rate

52.8%

Uncertainty

0.2492

The tradeoff: A model can be well-calibrated but not sharp (all predictions near 50%), or sharp but miscalibrated (confident but wrong). The ideal is both: confident predictions that match actual frequencies. The Pareto frontier shows the achievable combinations.

The sharpness-calibration tradeoff: A model that always predicts the base rate (e.g., 30% for all cases) is perfectly calibrated but has zero sharpness—no discriminative power. A model with extreme predictions (5% or 95%) is sharp but may be miscalibrated. The Pareto frontier shows achievable combinations.

Recalibration Methods

If a model is miscalibrated, we can fix it post-hoc without retraining. The key insight: miscalibration is a systematic transformation error. We can learn the inverse transformation.

Platt Scaling

Fit a logistic regression on the model's logits:

p_{\text{calibrated}} = \sigma(a \cdot \text{logit}(p) + b)

where $\sigma$ is the sigmoid and $(a, b)$ are learned from a calibration set.

When it works: When the model's rankings are good but probabilities are systematically biased (too confident or too conservative). The logistic function corrects the bias while preserving ordering.

Limitations: Assumes a specific functional form. May not fix complex miscalibration patterns.

Isotonic Regression

Fit a monotonically increasing step function:

p_{\text{calibrated}} = f(p), \quad f \text{ is monotonic}

Uses the pool-adjacent-violators algorithm (PAVA) to find the best monotonic fit.

When it works: When miscalibration is non-parametric (not a simple logistic bias). More flexible than Platt scaling.

Limitations: Can overfit with small calibration sets. The step function may be jagged.

Recalibration Methods

Miscalibrated probabilities can be fixed post-hoc. Platt scaling fits a logistic regression on logits. Isotonic regression fits a monotonic step function. Both preserve ranking while improving calibration.

Calibration Method:

None

Platt Scaling

Isotonic

Platt scaling: Fits parameters (a, b) to transform logit(p) → a·logit(p) + b → σ(·). Best when the original model is approximately well-ordered but has a systematic bias (over/underconfidence).

Isotonic regression: Non-parametric, fits a monotonic step function. More flexible but can overfit with small calibration sets. Preserves probability ranking.

Try both methods on the synthetic data. Notice how:

Platt scaling applies a smooth S-curve transformation
Isotonic regression creates a piecewise-constant mapping
Both improve ECE but may have different effects on sharpness

Applications

PE Deal Predictions

When a deal-scoring model outputs 60% probability of close, that should mean 60% of such deals actually close. In practice, PE models are often overconfident:

High-conviction deals (80%+ predicted) close at lower rates
Pipeline counts get inflated because "80%" doesn't mean what it says

Fix: Calibrate the model on historical close rates. Use Platt scaling if the model's ranking is good but probabilities are systematically off.

This connects to contextual bandits for origination—the reward signal (deal closed) feeds back into the model, and calibrated probabilities are essential for Thompson sampling to work.

Agent Tool Reliability

When an LLM agent says "I'm 90% confident this code is correct," is it actually right 90% of the time? Early studies suggest LLMs are often poorly calibrated—overconfident on hard questions, underconfident on easy ones.

For multi-agent coordination, calibrated confidence matters because:

Task allocation should weight by expected success
Consensus protocols need to weight votes by reliability
Market mechanisms work best when bids reflect true costs

Manager Evaluation

In hierarchical Bayes for PE, we estimate manager skill with posterior distributions. The calibration question: are our 80% credible intervals correct 80% of the time?

If intervals are too narrow (overconfident), we're underestimating uncertainty. If too wide, we're leaving information on the table. Posterior predictive checks assess this.

The Calibration-Sharpness Tradeoff

Both calibration and sharpness matter for decisions:

Calibration without sharpness: "All deals have 30% close rate." True but useless. No information for prioritization.

Sharpness without calibration: "This deal is 95%!" But if 95% predictions only close 60% of the time, you'll over-allocate resources.

The ideal: Sharp and calibrated. Confident when you should be, uncertain when appropriate.

The Brier score captures both. Proper scoring rules incentivize honest, discriminative forecasts. But they don't tell you which problem you have—the decomposition does.

Practical Recommendations

1. Always plot the reliability diagram

Before trusting any probabilistic model, visualize calibration. A model with great AUC can be terribly miscalibrated.

2. Use proper scoring rules

If you're training a model on probability outputs, use log loss or Brier score—not just classification accuracy. Proper scores reward honest probabilities.

3. Hold out a calibration set

Don't calibrate on training data. Use a held-out set to fit Platt/isotonic parameters. Better yet: use cross-validation.

4. Check calibration by subgroup

A model can be well-calibrated overall but miscalibrated within subgroups (e.g., by deal size, sector, or vintage). Check conditional calibration.

5. Recalibrate periodically

Calibration can drift over time if the underlying distribution shifts. Monitor ECE and recalibrate when it degrades.

Connections

The James-Stein Paradox: Shrinkage improves point estimates. Calibration asks whether our uncertainty estimates are honest.
Shrinkage Everywhere: The shrinkage weight $w$ implicitly assumes a prior. Is that prior calibrated?
Hierarchical Bayes for PE: Posterior intervals should be calibrated. Posterior predictive checks verify this.
We Buy Distributions, Not Deals: Decisions depend on distributions, which depend on calibrated probabilities.
Contextual Bandits for Origination: Thompson sampling requires calibrated reward probabilities.
Multi-Agent Coordination: Agent confidence should be calibrated for effective task allocation.

What Calibration Means

The Reliability Diagram

Reliability Diagram

Calibration Metrics

Expected Calibration Error (ECE)

Maximum Calibration Error (MCE)

Brier Score

Brier Score Decomposition

Brier Score Decomposition

Recalibration Methods

Platt Scaling

Isotonic Regression

Recalibration Methods

Applications

PE Deal Predictions

Agent Tool Reliability

Manager Evaluation

The Calibration-Sharpness Tradeoff

Practical Recommendations

1. Always plot the reliability diagram

2. Use proper scoring rules

3. Hold out a calibration set

4. Check calibration by subgroup

5. Recalibrate periodically

Connections

Further Reading