← Back to Blog

Week 11: When Your Model Forgets to Look Outside

How a 28.6% accuracy week revealed my NFL prediction model was collecting weather and rest data but completely ignoring it.

The Disaster

Week 11 was brutal. My NFL prediction model achieved 28.6% accuracy—getting just 4 out of 14 games correct. That's significantly worse than a coin flip, and more than two standard deviations below the model's historical 55% average.

Week 11 Performance Metrics
Metric Week 11 Season Average Delta
Accuracy 28.6% 55.2% -26.6 pp
Brier Score 0.325 0.249 +0.076
Log Loss 0.878 0.696 +0.182

The failure wasn't subtle. Major upsets I missed:

The Root Cause Investigation

When you're building ML models, the first instinct after a failure is to assume the model needs more complexity. Add more features! Try ensemble methods! Tune hyperparameters!

But I've learned that the best debugging starts with understanding what your model is actually doing versus what you think it's doing.

What I Discovered

My data pipeline had three stages:

  1. Feature Collection - Download weather, rest days, injuries from nflverse
  2. Adjustment Calculation - Convert features to ELO adjustments in dbt
  3. Webpage Generation - Export predictions to JSON for the site

Stages 1 and 2 were working perfectly. The problem was in stage 3.

# generate_full_webpage_data.py (BEFORE)
sim_df = pd.read_parquet("nfl_reg_season_simulator.parquet")

# This table had ONLY pure ELO predictions
# No weather, no rest adjustments, nothing

Meanwhile, a different table existed with all the context:

# The table I SHOULD have been using
pred_df = pd.read_parquet("nfl_predictions_with_features.parquet")

# This has:
# - home_win_prob_base (pure ELO)
# - home_win_prob_adjusted (with weather/rest)
# - rest_adj, temp_adj, wind_adj, injury_adj

I had built an entire feature engineering pipeline—collecting weather data, calculating rest day differentials, weighting injuries by position—and then completely ignored it when making predictions.

It's the ML equivalent of checking the weather forecast, packing an umbrella, and then leaving it at home.

The Fix

Once I found the root cause, the fix was straightforward but required coordinating changes across the pipeline:

1. Update Feature Collection for 2025

The feature collection script was failing for 2025 data because injury reports aren't published yet on nflverse. I modified it to gracefully handle missing data:

# collect_enhanced_features.py
try:
    injuries = nfl.load_injuries(seasons)
    injury_scores = calculate_team_injury_scores(injuries)
except (ConnectionError, Exception) as e:
    print(f"Warning: Could not load injury data: {e}")
    print("Continuing without injury data (injury scores will be 0)")
    injury_scores = None

Result: Successfully collected 272 games worth of weather and rest data for 2025 season.

2. Rebuild dbt Models

With 2025 features collected, I rebuilt the dbt transformation pipeline to calculate ELO adjustments:

$ dbt build

This populated the nfl_elo_adjustments table with 2025 data, applying these rules:

3. Update Webpage Generation

The critical change—switch from baseline ELO to feature-adjusted predictions:

# generate_full_webpage_data.py (AFTER)
pred_df = pd.read_parquet("nfl_predictions_with_features.parquet")
current_week_df = pred_df[pred_df['week_number'] == current_week]

# Use adjusted probabilities
'home_win_probability': float(game_data['home_win_prob_adjusted']),
'predicted_winner': game_data['predicted_winner_adjusted'],

# Export feature adjustments
'rest_adj': float(game_data.get('rest_adj', 0.0)),
'temp_adj': int(game_data.get('temp_adj', 0)),
'wind_adj': int(game_data.get('wind_adj', 0)),
'total_adj': float(game_data.get('total_adj', 0.0)),

The Impact

Looking back at Week 11 with the lens of feature adjustments, I can see what the model should have known:

Example: Tampa Bay @ Buffalo

The model correctly predicted Buffalo, but with lower confidence due to weather. The Bills won 44-32, so the prediction was right, but the weather-aware version would have been more calibrated.

Week 11 Summary with Adjustments

While these adjustments wouldn't have fixed all 10 missed predictions, they represent the model being context-aware instead of making predictions in a vacuum.

What I Learned

1. Infrastructure Debt is Sneaky

This wasn't a modeling problem. The ELO algorithm worked fine. The feature engineering pipeline worked fine. The bug was in the plumbing—which table gets exported to the website.

It's easy to build sophisticated data pipelines and then wire them together incorrectly. The fix took 3 lines of code. Finding it took hours of investigation.

2. Test Your Assumptions

I assumed my model was using weather and rest data because I had code that collected it. I never verified that assumption by actually checking the predictions.

A simple test would have caught this:

# Week 11 predictions
assert any(game['total_adj'] != 0 for game in predictions), \
    "No games have adjustments - features not being used!"

Now I have that test. It would have failed before the fix, and passes after.

3. Debugging Beats Guessing

When Week 11 bombed, I could have:

All of those might have helped, but they would have been premature optimization. The actual problem was simpler: the model wasn't using the features it already had.

Systematic debugging found the root cause in one session:

  1. Check what data is being collected → ✓ Features exist
  2. Check if adjustments are calculated → ✓ Adjustments exist
  3. Check if predictions use adjustments → ✗ Using wrong table

Root cause found. Fix implemented. Problem solved.

What's Next

This fix addresses Improvement #1 from my post-Week 11 analysis: Feature Integration & Calibration.

The model now incorporates:

But there are still 6 more improvements on the roadmap:

  1. Temporal Decay on ELO - Weight recent games more heavily
  2. Calibration Layer - Isotonic regression to improve probability estimates
  3. Ensemble Methods - Combine ELO with vegas lines and recent form
  4. Feature Engineering - Offense/defense splits, matchup-specific factors
  5. Model Validation - Proper time-series cross-validation
  6. Uncertainty Quantification - Confidence intervals on predictions

Each improvement will be its own experiment. Week 12 predictions are now live with feature adjustments enabled. Let's see if context-aware predictions beat pure ELO.

Update: Phase 1 Complete (November 19, 2025)

After publishing this post, I implemented all three Phase 1 improvements from the roadmap. Here's what happened:

What I Built

1. Recent Form & Momentum Tracking

Added 3-game rolling averages to detect hot and cold teams:

2. Vegas Line Ensemble (50/50 Model)

Integrated The Odds API to combine our ELO model with market wisdom:

3. Feature Recalibration

Doubled down on adjustments that were too conservative:

4. Live Weather Forecasts

Built integration with National Weather Service API:

The Results

I backtested all improvements on Week 11 data:

Model Version Accuracy Improvement
Baseline (pure ELO) 28.6% (4/14)
+ Features (original) 28.6% (4/14) +0 games
+ Momentum tracking 35.7% (5/14) +1 game ✓
+ Vegas ensemble 35.7% (5/14) +0 games (moderated probabilities)

Key findings:

Week 12 Predictions Now Live

The ensemble model is now deployed with all Phase 1 improvements. Some interesting Week 12 predictions:

What's Next

Phase 1 is complete. Phase 2 improvements:

  1. Calibration Layer - Isotonic regression on probabilities
  2. Model Validation - Time-series cross-validation framework

The model has evolved from pure ELO to an ensemble that combines:

Week 12 will be the real test. Let's see how the ensemble performs.

Conclusion

Week 11 taught me an important lesson: sophisticated doesn't mean correct.

I built a feature engineering pipeline that downloads weather data, calculates rest day advantages, and weights injuries by position importance. It was sophisticated, well-tested, and completely unused by the final model.

The fix was unglamorous—change which database table gets queried. But that's often how bugs work in production systems. The hard part isn't writing clever algorithms; it's ensuring all the pieces wire together correctly.

But fixing the wiring was just the start. Once the model could actually use the features, I realized they were too conservative. So I recalibrated them, added momentum tracking, integrated Vegas lines for ensemble predictions, and built a live weather forecast system.

Phase 1 is complete. The model went from 28.6% accuracy with ignored features to 35.7% with momentum-aware ensemble predictions. Not perfect, but moving in the right direction.

Now let's see what Week 12 brings.