JAN16
FRI2026

Calibration: Are Your Probabilities Honest?

Reliability diagrams, Brier scores, and the sharpness-calibration tradeoff
bayesianstatisticscalibrationpredictiondecision-makingml

When a weather forecaster says "70% chance of rain," we expect rain about 70% of the time they make that prediction. When a PE deal model outputs "60% probability of close," we expect 60% of such deals to actually close. This correspondence between stated probability and observed frequency is calibration.

A calibrated forecaster is honest: their probabilities mean what they say. An overconfident forecaster claims 90% when the true rate is 70%. An underconfident one hedges toward 50% even when outcomes are predictable. Both are miscalibrated—their stated probabilities don't match reality.

This post covers:

  1. Reliability diagrams — visualizing calibration gaps
  2. Proper scoring rules — why Brier score and log loss incentivize honesty
  3. Brier decomposition — separating calibration from discrimination
  4. Recalibration methods — Platt scaling and isotonic regression
  5. The sharpness-calibration tradeoff — why both matter for decisions

If you've used shrinkage estimators (James-Stein, hierarchical models), calibration is the natural next question: once you have a point estimate, how confident should you be?


What Calibration Means

Formally, a probabilistic forecast pp is calibrated if:

P(Y=1p=x)=xx[0,1]\mathbb{P}(Y = 1 \mid p = x) = x \quad \forall x \in [0, 1]

Among all the times you predict probability xx, the fraction of positives should be xx.

Examples of calibration:

  • A 70% prediction should be right 70% of the time
  • A 30% prediction should be right 30% of the time
  • A 95% prediction should be wrong 1 in 20 times

Examples of miscalibration:

  • Claiming 90% when you're right 70% of the time → overconfident
  • Claiming 50% when you're right 80% of the time → underconfident
  • High predictions being too high, low predictions being too low → systematic overconfidence

The last pattern is most common in ML models and human forecasters alike. We tend to be overconfident at both extremes.


The Reliability Diagram

The reliability diagram (or calibration curve) is the primary tool for assessing calibration. It plots:

  • X-axis: Mean predicted probability in each bin
  • Y-axis: Actual fraction of positives in that bin

A perfectly calibrated model lies on the diagonal. Deviations show miscalibration.

Reliability Diagram

A perfectly calibrated model lies on the diagonal: when it predicts 70% probability, 70% of those cases should be positive. The gap between the curve and diagonal shows miscalibration. Marker size indicates bin count.

Number of Bins: 10

5
10
20
ECE: 10.2%
Brier: 0.172
00.20.40.60.8100.20.40.60.81
Perfect CalibrationUncalibratedReliability DiagramMean Predicted ProbabilityFraction of Positives
00.51020406080100120
Prediction DistributionPredicted ProbabilityCount

Expected Calibration Error

10.24%

Max Calibration Error

16.53%

Sharpness (Pred. Variance)

0.1484

Reading the diagram: Points above the diagonal = underconfident (actual frequency > predicted). Points below = overconfident (actual frequency < predicted). The shaded red area quantifies the total calibration error.

Reading the diagram:

  • Points above the diagonal: Underconfident. The model predicts lower probabilities than realized frequencies.
  • Points below the diagonal: Overconfident. The model predicts higher probabilities than realized frequencies.
  • Marker size: Number of predictions in that bin. Larger markers are more reliable.
  • Error bars: 95% confidence intervals from binomial sampling. Overlap with the diagonal suggests adequate calibration.

The histogram on the right shows the prediction distribution—how often the model makes different confidence levels of predictions. A model that always predicts 50% is calibrated but useless. We want sharpness: predictions spread across the probability range.


Calibration Metrics

Expected Calibration Error (ECE)

The most common summary metric. Bin predictions, compute the gap between mean prediction and actual frequency, weight by bin size:

ECE=b=1Bnbnpˉbyˉb\text{ECE} = \sum_{b=1}^B \frac{n_b}{n} \cdot |\bar{p}_b - \bar{y}_b|

where pˉb\bar{p}_b is mean predicted probability in bin bb and yˉb\bar{y}_b is the actual positive rate.

Interpretation: ECE is the expected absolute difference between confidence and accuracy.

Maximum Calibration Error (MCE)

The worst-case gap across bins:

MCE=maxbpˉbyˉb\text{MCE} = \max_b |\bar{p}_b - \bar{y}_b|

Useful when you care about worst-case miscalibration (e.g., safety-critical applications).

Brier Score

A proper scoring rule that directly measures probabilistic accuracy:

Brier=1ni=1n(piyi)2\text{Brier} = \frac{1}{n} \sum_{i=1}^n (p_i - y_i)^2

Lower is better. Range: 0 (perfect) to 1 (maximally wrong).

Why proper? A scoring rule is "proper" if the forecaster maximizes expected score by reporting true beliefs. The Brier score punishes both miscalibration and lack of discrimination. It's the MSE of probability forecasts.


Brier Score Decomposition

The Brier score decomposes into three interpretable components:

Brier=CalibrationbadRefinementgood+Uncertaintyirreducible\text{Brier} = \underbrace{\text{Calibration}}_{\text{bad}} - \underbrace{\text{Refinement}}_{\text{good}} + \underbrace{\text{Uncertainty}}_{\text{irreducible}}
  • Calibration (reliability): How close bin means are to actual frequencies. Lower is better.
  • Refinement (resolution): How much bin means vary around the base rate. Higher is better—it means the model separates outcomes.
  • Uncertainty: yˉ(1yˉ)\bar{y}(1-\bar{y}), the variance from the base rate. Fixed by the data.

This decomposition reveals that a model can have a good Brier score by being well-calibrated or by having high resolution. The ideal is both.

Brier Score Decomposition

The Brier score decomposes into three components: Calibration (error from miscalibrated probabilities), Refinement (ability to separate outcomes), and Uncertainty (irreducible noise from base rate).

Miscalibration Level: 0.40

Perfect
Moderate
Severe
Brier Score: 0.1718
Irreducible61.3%Refinement Gain33.9%Calibration Loss4.73%
Brier Score Components
Current00.020.040.060.08
Calibration (bad)Refinement (good)Brier Score DecompositionComponent Value
ABCDECurrent00.050.10.150.200.10.20.3
Pareto FrontierSharpness vs Calibration TradeoffCalibration Error (ECE)Sharpness (Pred. Variance)

Calibration Component

0.0124

(Lower is better)

Refinement Component

0.0888

(Higher is better)

Base Rate

52.8%

Uncertainty

0.2492

The tradeoff: A model can be well-calibrated but not sharp (all predictions near 50%), or sharp but miscalibrated (confident but wrong). The ideal is both: confident predictions that match actual frequencies. The Pareto frontier shows the achievable combinations.

The sharpness-calibration tradeoff: A model that always predicts the base rate (e.g., 30% for all cases) is perfectly calibrated but has zero sharpness—no discriminative power. A model with extreme predictions (5% or 95%) is sharp but may be miscalibrated. The Pareto frontier shows achievable combinations.


Recalibration Methods

If a model is miscalibrated, we can fix it post-hoc without retraining. The key insight: miscalibration is a systematic transformation error. We can learn the inverse transformation.

Platt Scaling

Fit a logistic regression on the model's logits:

pcalibrated=σ(alogit(p)+b)p_{\text{calibrated}} = \sigma(a \cdot \text{logit}(p) + b)

where σ\sigma is the sigmoid and (a,b)(a, b) are learned from a calibration set.

When it works: When the model's rankings are good but probabilities are systematically biased (too confident or too conservative). The logistic function corrects the bias while preserving ordering.

Limitations: Assumes a specific functional form. May not fix complex miscalibration patterns.

Isotonic Regression

Fit a monotonically increasing step function:

pcalibrated=f(p),f is monotonicp_{\text{calibrated}} = f(p), \quad f \text{ is monotonic}

Uses the pool-adjacent-violators algorithm (PAVA) to find the best monotonic fit.

When it works: When miscalibration is non-parametric (not a simple logistic bias). More flexible than Platt scaling.

Limitations: Can overfit with small calibration sets. The step function may be jagged.

Recalibration Methods

Miscalibrated probabilities can be fixed post-hoc. Platt scaling fits a logistic regression on logits. Isotonic regression fits a monotonic step function. Both preserve ranking while improving calibration.

Calibration Method:

00.5100.20.40.60.81
PerfectBefore CalibrationCalibration Curves: Before vs AfterMean PredictedFraction Positive
00.5100.20.40.60.81
Probability TransformationOriginal PredictionCalibrated Prediction
ECEMCEBrier Score00.050.10.15
Metrics ComparisonValue

Platt scaling: Fits parameters (a, b) to transform logit(p) → a·logit(p) + b → σ(·). Best when the original model is approximately well-ordered but has a systematic bias (over/underconfidence).

Isotonic regression: Non-parametric, fits a monotonic step function. More flexible but can overfit with small calibration sets. Preserves probability ranking.

Try both methods on the synthetic data. Notice how:

  • Platt scaling applies a smooth S-curve transformation
  • Isotonic regression creates a piecewise-constant mapping
  • Both improve ECE but may have different effects on sharpness

Applications

PE Deal Predictions

When a deal-scoring model outputs 60% probability of close, that should mean 60% of such deals actually close. In practice, PE models are often overconfident:

  • High-conviction deals (80%+ predicted) close at lower rates
  • Pipeline counts get inflated because "80%" doesn't mean what it says

Fix: Calibrate the model on historical close rates. Use Platt scaling if the model's ranking is good but probabilities are systematically off.

This connects to contextual bandits for origination—the reward signal (deal closed) feeds back into the model, and calibrated probabilities are essential for Thompson sampling to work.

Agent Tool Reliability

When an LLM agent says "I'm 90% confident this code is correct," is it actually right 90% of the time? Early studies suggest LLMs are often poorly calibrated—overconfident on hard questions, underconfident on easy ones.

For multi-agent coordination, calibrated confidence matters because:

  • Task allocation should weight by expected success
  • Consensus protocols need to weight votes by reliability
  • Market mechanisms work best when bids reflect true costs

Manager Evaluation

In hierarchical Bayes for PE, we estimate manager skill with posterior distributions. The calibration question: are our 80% credible intervals correct 80% of the time?

If intervals are too narrow (overconfident), we're underestimating uncertainty. If too wide, we're leaving information on the table. Posterior predictive checks assess this.


The Calibration-Sharpness Tradeoff

Both calibration and sharpness matter for decisions:

Calibration without sharpness: "All deals have 30% close rate." True but useless. No information for prioritization.

Sharpness without calibration: "This deal is 95%!" But if 95% predictions only close 60% of the time, you'll over-allocate resources.

The ideal: Sharp and calibrated. Confident when you should be, uncertain when appropriate.

The Brier score captures both. Proper scoring rules incentivize honest, discriminative forecasts. But they don't tell you which problem you have—the decomposition does.


Practical Recommendations

1. Always plot the reliability diagram

Before trusting any probabilistic model, visualize calibration. A model with great AUC can be terribly miscalibrated.

2. Use proper scoring rules

If you're training a model on probability outputs, use log loss or Brier score—not just classification accuracy. Proper scores reward honest probabilities.

3. Hold out a calibration set

Don't calibrate on training data. Use a held-out set to fit Platt/isotonic parameters. Better yet: use cross-validation.

4. Check calibration by subgroup

A model can be well-calibrated overall but miscalibrated within subgroups (e.g., by deal size, sector, or vintage). Check conditional calibration.

5. Recalibrate periodically

Calibration can drift over time if the underlying distribution shifts. Monitor ECE and recalibrate when it degrades.


Connections


Further Reading

  • Gneiting & Raftery (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. The theory of proper scoring.
  • Platt (1999). Probabilistic Outputs for Support Vector Machines. The original Platt scaling paper.
  • Zadrozny & Elkan (2002). Transforming Classifier Scores into Accurate Multiclass Probability Estimates. Isotonic regression for calibration.
  • Guo et al. (2017). On Calibration of Modern Neural Networks. Shows that deep networks are often miscalibrated.
  • Niculescu-Mizil & Caruana (2005). Predicting Good Probabilities with Supervised Learning. Empirical comparison of calibration methods.

The punchline: probabilities are only useful if they're honest. A 70% that means 70% enables good decisions. A 70% that means 50% is worse than useless—it's misleading. Calibration is the bridge between statistical models and real-world decision-making. Check it, measure it, fix it.