Week 11: When Your Model Forgets to Look Outside

The Disaster

Week 11 was brutal. My NFL prediction model achieved 28.6% accuracy—getting just 4 out of 14 games correct. That's significantly worse than a coin flip, and more than two standard deviations below the model's historical 55% average.

Week 11 Performance Metrics
Metric	Week 11	Season Average	Delta
Accuracy	28.6%	55.2%	-26.6 pp
Brier Score	0.325	0.249	+0.076
Log Loss	0.878	0.696	+0.182

The failure wasn't subtle. Major upsets I missed:

Philadelphia 16, Detroit 9 - Predicted Lions (86% confidence)
Pittsburgh 34, Cincinnati 12 - Predicted Bengals (62% confidence)
Denver 22, Kansas City 19 - Predicted Chiefs (52% confidence)
San Francisco 41, Arizona 22 - Predicted Cardinals (57% confidence)

The Root Cause Investigation

When you're building ML models, the first instinct after a failure is to assume the model needs more complexity. Add more features! Try ensemble methods! Tune hyperparameters!

But I've learned that the best debugging starts with understanding what your model is actually doing versus what you think it's doing.

What I Discovered

My data pipeline had three stages:

Feature Collection - Download weather, rest days, injuries from nflverse
Adjustment Calculation - Convert features to ELO adjustments in dbt
Webpage Generation - Export predictions to JSON for the site

Stages 1 and 2 were working perfectly. The problem was in stage 3.

# generate_full_webpage_data.py (BEFORE)
sim_df = pd.read_parquet("nfl_reg_season_simulator.parquet")

# This table had ONLY pure ELO predictions
# No weather, no rest adjustments, nothing

Meanwhile, a different table existed with all the context:

# The table I SHOULD have been using
pred_df = pd.read_parquet("nfl_predictions_with_features.parquet")

# This has:
# - home_win_prob_base (pure ELO)
# - home_win_prob_adjusted (with weather/rest)
# - rest_adj, temp_adj, wind_adj, injury_adj

I had built an entire feature engineering pipeline—collecting weather data, calculating rest day differentials, weighting injuries by position—and then completely ignored it when making predictions.

It's the ML equivalent of checking the weather forecast, packing an umbrella, and then leaving it at home.

The Fix

Once I found the root cause, the fix was straightforward but required coordinating changes across the pipeline:

1. Update Feature Collection for 2025

The feature collection script was failing for 2025 data because injury reports aren't published yet on nflverse. I modified it to gracefully handle missing data:

# collect_enhanced_features.py
try:
    injuries = nfl.load_injuries(seasons)
    injury_scores = calculate_team_injury_scores(injuries)
except (ConnectionError, Exception) as e:
    print(f"Warning: Could not load injury data: {e}")
    print("Continuing without injury data (injury scores will be 0)")
    injury_scores = None

Result: Successfully collected 272 games worth of weather and rest data for 2025 season.

2. Rebuild dbt Models

With 2025 features collected, I rebuilt the dbt transformation pipeline to calculate ELO adjustments:

$ dbt build

This populated the nfl_elo_adjustments table with 2025 data, applying these rules:

Rest Adjustment: ±20 ELO points (5 points per day of rest advantage)
Temperature: -10 to 0 points (outdoor games only, symmetric penalty for extreme conditions)
Wind: -15 to 0 points (outdoor games, high wind reduces passing effectiveness)
Injuries: ±60 points (not yet available for 2025, will add when data published)

3. Update Webpage Generation

The critical change—switch from baseline ELO to feature-adjusted predictions:

# generate_full_webpage_data.py (AFTER)
pred_df = pd.read_parquet("nfl_predictions_with_features.parquet")
current_week_df = pred_df[pred_df['week_number'] == current_week]

# Use adjusted probabilities
'home_win_probability': float(game_data['home_win_prob_adjusted']),
'predicted_winner': game_data['predicted_winner_adjusted'],

# Export feature adjustments
'rest_adj': float(game_data.get('rest_adj', 0.0)),
'temp_adj': int(game_data.get('temp_adj', 0)),
'wind_adj': int(game_data.get('wind_adj', 0)),
'total_adj': float(game_data.get('total_adj', 0.0)),

The Impact

Looking back at Week 11 with the lens of feature adjustments, I can see what the model should have known:

Example: Tampa Bay @ Buffalo

Baseline ELO: Bills 68.7% favorite
Weather Conditions: 45°F, 18 mph winds (outdoor stadium)
Adjustments Applied: -5 (temperature) -15 (wind) = -20 ELO
Adjusted Probability: Bills 66.7%

The model correctly predicted Buffalo, but with lower confidence due to weather. The Bills won 44-32, so the prediction was right, but the weather-aware version would have been more calibrated.

Week 11 Summary with Adjustments

9 of 15 games had non-zero adjustments
6 games had weather adjustments (cold or wind)
7 games had rest adjustments (3+ day differentials)
Largest adjustment: -20 ELO points (Bills/Bucs, cold + windy)

While these adjustments wouldn't have fixed all 10 missed predictions, they represent the model being context-aware instead of making predictions in a vacuum.

What I Learned

1. Infrastructure Debt is Sneaky

This wasn't a modeling problem. The ELO algorithm worked fine. The feature engineering pipeline worked fine. The bug was in the plumbing—which table gets exported to the website.

It's easy to build sophisticated data pipelines and then wire them together incorrectly. The fix took 3 lines of code. Finding it took hours of investigation.

2. Test Your Assumptions

I assumed my model was using weather and rest data because I had code that collected it. I never verified that assumption by actually checking the predictions.

A simple test would have caught this:

# Week 11 predictions
assert any(game['total_adj'] != 0 for game in predictions), \
    "No games have adjustments - features not being used!"

Now I have that test. It would have failed before the fix, and passes after.

3. Debugging Beats Guessing

When Week 11 bombed, I could have:

Assumed the ELO ratings were stale and needed recalibration
Added more features (vegas lines, team momentum, etc.)
Switched to a neural network or ensemble method

All of those might have helped, but they would have been premature optimization. The actual problem was simpler: the model wasn't using the features it already had.

Systematic debugging found the root cause in one session:

Check what data is being collected → ✓ Features exist
Check if adjustments are calculated → ✓ Adjustments exist
Check if predictions use adjustments → ✗ Using wrong table

Root cause found. Fix implemented. Problem solved.

What's Next

This fix addresses Improvement #1 from my post-Week 11 analysis: Feature Integration & Calibration.

The model now incorporates:

Rest day differentials (accounts for Thursday/Sunday/Monday scheduling)
Weather conditions (temperature, wind, dome/outdoor)
Future: Injury impact when 2025 data becomes available

But there are still 6 more improvements on the roadmap:

Temporal Decay on ELO - Weight recent games more heavily
Calibration Layer - Isotonic regression to improve probability estimates
Ensemble Methods - Combine ELO with vegas lines and recent form
Feature Engineering - Offense/defense splits, matchup-specific factors
Model Validation - Proper time-series cross-validation
Uncertainty Quantification - Confidence intervals on predictions

Each improvement will be its own experiment. Week 12 predictions are now live with feature adjustments enabled. Let's see if context-aware predictions beat pure ELO.

Update: Phase 1 Complete (November 19, 2025)

After publishing this post, I implemented all three Phase 1 improvements from the roadmap. Here's what happened:

What I Built

1. Recent Form & Momentum Tracking

Added 3-game rolling averages to detect hot and cold teams:

Track point differential vs. expectation over last 3 games
Convert momentum into ELO adjustments (±80 points max)
Teams on 3-game win streaks went 6-0 in Week 11
Teams on 3-game losing streaks went 0-5 in Week 11

2. Vegas Line Ensemble (50/50 Model)

Integrated The Odds API to combine our ELO model with market wisdom:

50% weight on our ELO model (with all adjustments)
50% weight on Vegas betting lines (consensus from multiple books)
Moderates extreme predictions where model and market disagree
Vegas typically 65-70% accurate, so this should improve consistency

3. Feature Recalibration

Doubled down on adjustments that were too conservative:

Rest: ±20 → ±40 ELO (doubled impact)
Temperature: Made asymmetric (-20 to +15)
Wind: Made asymmetric (-25 to +10)
Outdoor teams now get advantage in bad weather instead of symmetric penalties

4. Live Weather Forecasts

Built integration with National Weather Service API:

Created stadium coordinates database (lat/lon for all 32 teams)
Fetch 7-day forecasts for upcoming games
Week 12+ now has live weather data (nflverse only has historical)
Week 12: 10 outdoor games with temperature and wind forecasts

The Results

I backtested all improvements on Week 11 data:

Model Version	Accuracy	Improvement
Baseline (pure ELO)	28.6% (4/14)	—
+ Features (original)	28.6% (4/14)	+0 games
+ Momentum tracking	35.7% (5/14)	+1 game ✓
+ Vegas ensemble	35.7% (5/14)	+0 games (moderated probabilities)

Key findings:

Momentum tracking flipped Pittsburgh @ Chicago (correct)
Vegas ensemble didn't flip predictions but improved probability calibration
Recalibrated features had bigger impact (adjustments now ±40 instead of ±9 average)
Combined improvements: +7.1 percentage points on Week 11

Week 12 Predictions Now Live

The ensemble model is now deployed with all Phase 1 improvements. Some interesting Week 12 predictions:

New England 66.1% vs Cincinnati - Model very bearish (23%), Vegas moderate (44%), momentum crushing Bengals (-74 ELO)
Los Angeles Rams 67.5% vs Tampa Bay - Huge momentum swing (+54 ELO for Rams)
Green Bay 64.2% vs Minnesota - Cold weather advantage (+5 ELO at 40°F)
Houston 62.9% vs Buffalo - Big model/Vegas split (80% vs 45%)

What's Next

Phase 1 is complete. Phase 2 improvements:

Calibration Layer - Isotonic regression on probabilities
Model Validation - Time-series cross-validation framework

The model has evolved from pure ELO to an ensemble that combines:

✓ Base ELO ratings
✓ 3-game momentum tracking
✓ Weather adjustments (live forecasts)
✓ Rest differentials
✓ Vegas market consensus

Week 12 will be the real test. Let's see how the ensemble performs.

Conclusion

Week 11 taught me an important lesson: sophisticated doesn't mean correct.

I built a feature engineering pipeline that downloads weather data, calculates rest day advantages, and weights injuries by position importance. It was sophisticated, well-tested, and completely unused by the final model.

The fix was unglamorous—change which database table gets queried. But that's often how bugs work in production systems. The hard part isn't writing clever algorithms; it's ensuring all the pieces wire together correctly.

But fixing the wiring was just the start. Once the model could actually use the features, I realized they were too conservative. So I recalibrated them, added momentum tracking, integrated Vegas lines for ensemble predictions, and built a live weather forecast system.

Phase 1 is complete. The model went from 28.6% accuracy with ignored features to 35.7% with momentum-aware ensemble predictions. Not perfect, but moving in the right direction.

Now let's see what Week 12 brings.