DEC30
TUE2025

Exploratory Data Analysis

What the data tells us before we start modeling.
EDAvisualizationR

This is the second post in a series on Himalayan mountaineering data. The first post introduced the dataset and its hierarchical structure. Here I'll walk through exploratory data analysis—the part where you get to know the data before fitting models.

I'm a believer in thorough EDA. Not because it's intellectually glamorous, but because it catches problems early. Missing data patterns, outliers, unexpected correlations—these are easier to handle when you've seen them rather than discovered them through mysterious model failures.


Missing Data Patterns

Missing data is usually the first thing I check. Not just "how much is missing," but where and why.

Missing data pattern visualization

A few observations:

  • Route information has substantial missingness. This makes sense—some expeditions take standard routes that don't need annotation; others take novel routes that should be documented but aren't.

  • Date fields (summit date, base camp date) have varying completeness. Expeditions that failed early may not have summit dates recorded at all— which is informative missingness, not random.

  • Deaths are coded as 0 when no deaths occurred, so missing deaths likely means "unknown" rather than "zero." This distinction matters.

The pattern of missingness can be a feature itself. If summit_date is missing, that's highly predictive of failure—you can't have a summit date without summiting. Whether to encode this as an indicator variable depends on your modeling goals.


Temporal Trends

The Himalayan climbing scene has changed dramatically over decades. This dataset covers 2020-2024, a specific window that includes:

  • COVID-19 disruptions (2020-2021)

  • The recovery boom (2022-2024)

    Expeditions over time

What about success rates?

Success rate over time

The shaded region shows Wilson confidence intervals—a better choice for proportions than the naive normal approximation, especially near 0 or 1.

Temporal trends matter for modeling. If success rates are improving over time (better gear, better forecasting, more experience), then a model trained on old data may underpredict future success. This is a form of distribution shift that's worth monitoring.


Season Distribution

Spring dominates Himalayan climbing. This isn't arbitrary—it's driven by the jet stream.

Expeditions by season

Most expeditions occur in Spring (pre-monsoon, roughly April-May) when the jet stream briefly lifts off the high peaks. Autumn (post-monsoon, roughly September-October) is the secondary season.

Does season affect success?

Success rate by season
SeasonnSuccess Rate
Spring46276.2%
Autumn39466.5%
Winter2133.3%
Summer560.0%

The pattern here is interesting. Spring has the highest volume and the highest success rate (76%). Winter attempts are rare (n=21) and risky (33%)— the error bars are wide because sample sizes are small.

This is a case where main season vs off-season might be a useful binary encoding. The detailed 4-season breakdown adds complexity without much predictive gain once you've separated Spring/Autumn from Summer/Winter.


Team Composition

Team size has a non-linear relationship with success.

Team size distribution

The distribution is right-skewed with a long tail. Most teams are small (2-6 members), but some commercial expeditions have 20+ members.

Success by team size
Team SizenSuccess Rate
Solo5565.5%
2-319359.6%
4-515158.9%
6-1023672.9%
11+24785.8%

Solo attempts stand out—but not in the way you might expect. Solo climbers (65.5%) actually have higher success rates than small teams (2-5: ~59%), though lower than large teams (6+: 73-86%). This non-monotonic pattern suggests selection effects: solo climbers are likely elite alpinists attempting peaks they know well, while small teams may include less experienced climbers without the support infrastructure of large commercial expeditions.

This suggests is_solo should be a distinct indicator rather than treating team size as purely continuous—the relationship isn't linear.

Staff Ratio

The ratio of hired staff (Sherpas, porters, guides) to team members captures something about expedition resources and style.

Success by staff ratio
Staff RationSuccess Rate
No hired staff16542.4%
Low (under 0.5)9869.4%
Medium (0.5-1)20979.9%
High (1-2)31377.6%
Very high (2+)9778.4%

Higher staff ratios correlate with better outcomes (42% → 78%). But this is heavily confounded—expeditions with high staff ratios tend to be:

  • Commercial (more resources)
  • On popular routes (established infrastructure)
  • Better funded (can afford more support)

The staff ratio is a proxy for multiple underlying factors. In a predictive model it's useful; for causal inference it's problematic.


Oxygen Use

Supplemental oxygen is one of the strongest univariate predictors of success.

Success by oxygen use
Oxygen UsenSuccess Rate
No O247355.0%
With O240989.0%

The gap is dramatic (+34 pp). But—and I keep emphasizing this—oxygen use is endogenous. It's not randomly assigned. The expeditions that use oxygen are systematically different:

Oxygen use over time

Oxygen use has been trending upward in this window. The commercialization of high-altitude mountaineering brings more clients who prefer (or require) supplemental oxygen.

A proper causal analysis would need to:

  1. Match expeditions on observable confounders (peak, season, year)
  2. Use instrumental variables (if any exist)
  3. Model selection into oxygen use explicitly

I don't do that here, but it's worth flagging. Interestingly, the naive correlation actually understates the causal effect—this is negative confounding. Oxygen users target harder peaks, which suppresses the apparent benefit. The adjusted effect (~55-62pp) is larger than the naive association (~34pp).


Peak-Level Analysis

This is where the hierarchical structure becomes visible.

Top 15 most attempted peaks

Everest dominates. Cho Oyu, Manaslu, and Ama Dablam are also popular. The color gradient shows success rates—notice how they vary substantially across peaks.

Success rate by peak height

The GAM smooth shows a non-linear decline. Success rates hold steady until about 7,500m, then drop sharply. The 8,000m threshold (dashed line) marks the Death Zone.

Peak Variation

This plot is important for understanding why hierarchical models matter:

Success rate variation across peaks

Each dot is a peak (with 10+ expeditions). The horizontal spread shows between-peak variation. Some peaks have 80%+ success rates; others are below 30%. This variation is partly explained by height, but not entirely—route difficulty, weather patterns, and infrastructure differ.

The dashed line is the overall mean. Peaks with few expeditions and extreme success rates are good candidates for shrinkage—we shouldn't fully trust a 100% rate from 3 expeditions.


Risk and Termination

Why do expeditions end? The termination reasons tell a story.

Termination reasons

"Success" is the most common reason—which is good. But weather, route conditions, and team issues are significant. Deaths are relatively rare (thankfully), but they do occur.

Deaths over time

Deaths are low-count events—typically single digits per year in this dataset. Modeling mortality would require different techniques: Poisson regression, zero-inflation, or a two-stage model (any death? → how many?).


Correlation Structure

Before modeling, it's useful to check which predictors are correlated.

Correlation matrix

Some expected patterns:

  • totmembers and tothired are positively correlated (bigger expeditions bring more staff)
  • o2used and heightm are correlated (oxygen more common on higher peaks)
  • success1 correlates with o2used (the confounded association we discussed)

High correlation between predictors can cause multicollinearity in regression. For prediction it's less of a concern, but for interpretation it matters.


Bayesian Preview

I'll cover Bayesian shrinkage properly in the next post, but here's a preview.

The overall success rate can be estimated with a Beta-Binomial model:

Bayesian posterior for overall success rate

This posterior distribution captures our uncertainty about the true success rate. The 95% credible interval is tight because we have hundreds of observations.

The more interesting application is per-peak shrinkage:

Empirical Bayes shrinkage

The dotted diagonal (45°) represents "no shrinkage"—where raw and shrunk estimates would be equal. The orange fitted line shows the actual relationship: it has a shallower slope, meaning extreme raw rates get pulled toward the mean. Small-sample peaks (small dots) shrink more than high-sample peaks (large dots).

This is Empirical Bayes in action. It's the same idea behind baseball batting average shrinkage, hospital quality metrics, and the James-Stein estimator. The next post develops this further.


What I Learned

A few takeaways from EDA:

  1. Missing data is informative. Routes, summit dates, and other fields have patterns that relate to outcomes.

  2. Team composition matters non-linearly. Solo attempts are distinct; the effect of additional team members diminishes.

  3. Oxygen is confounded. The naive association understates causality.

  4. Peaks vary substantially. Between-peak variation justifies hierarchical modeling.

  5. Height has a threshold effect. The 8,000m mark isn't just symbolic— it's where physiology and success rates shift.

The next post translates these findings into features for modeling.