● MLB BETTING Q&A · BY MARCDUCK

What is the Brier Score in Sports Betting?

The Brier score measures how well calibrated probability predictions are to actual outcomes. Lower is better. Coin flip baseline is 0.25; strong sports models hit 0.18-0.22; weak models are above 0.25. Brier score is the gold-standard metric for evaluating a sports betting model's probability calibration.

What the Brier Score Measures

The Brier score is the mean squared difference between predicted probability and actual outcome (0 or 1). For each prediction: (prob - outcome)^2. Average across all predictions. Lower score = better calibration.

Example: model predicts 60% probability for an outcome that happens (outcome = 1). Squared error: (0.60 - 1)^2 = 0.16.

Same model predicts 60% probability for an outcome that doesn't happen (outcome = 0). Squared error: (0.60 - 0)^2 = 0.36.

If half the 60% predictions hit, average squared error is (0.16 + 0.36) / 2 = 0.26. The model is poorly calibrated (60% predictions should hit 60% of the time, not 50%).

Why Brier Score Matters

Hit rate alone is misleading. A model that always predicts 50% on coin flips will have 50% hit rate forever — but the predictions provide no information. Brier score penalizes overconfidence AND underconfidence. A model that predicts 80% on something that hits 65% of the time gets penalized for overconfidence even though its predictions are technically correct.

Brier score is the cleanest single number for "is this model calibrated to reality?"

Brier Score Benchmarks

0.10-0.15: elite calibration (rare; usually requires very narrow prediction problem or insider information)
0.15-0.20: strong calibration (sharp MLB models)
0.20-0.22: good calibration (solid models)
0.22-0.25: weak calibration (barely better than coin flip)
0.25: coin flip baseline (always predict 50%)
Above 0.25: worse than random; the model is actively misleading

Brier Score vs Hit Rate

Hit rate and Brier score measure different things:

Hit rate: what percentage of predictions came true. Useful for marketing.
Brier score: how well calibrated the probability values are. Useful for betting (because betting requires probability comparisons to market prices).

A model can have decent hit rate but bad Brier score if its high-confidence predictions are systematically overconfident. Sharp bettors care about Brier score because they bet by probability, not by binary picks.

Brier Score by Bucket

Aggregate Brier score hides important information. Better: compute Brier per probability bucket.

50-55% predictions → should hit 52.5% on average
55-60% → should hit 57.5%
60-65% → should hit 62.5%
65-70% → should hit 67.5%
70%+ → should hit 75%

Per-bucket calibration tells you WHERE the model is miscalibrated. A model with strong overall Brier might still be overconfident in 55-60% predictions (a common pattern in MLB models). Sharp bettors identify those buckets and skip them.

How to Improve Brier Score

Platt scaling. Fit a logistic regression on (predicted_prob, outcome) pairs from historical data. Apply the regression to future predictions to recalibrate them. Reduces overconfidence in tails.
Isotonic regression. Non-parametric calibration alternative to Platt scaling. Better when calibration error is non-monotonic.
Per-bucket recalibration. Apply different calibration adjustments per probability bucket. Most accurate but requires larger sample sizes per bucket.
Ensemble. Combine multiple models with different miscalibration patterns. Average tends to be better calibrated than any individual model.

Brier Score in Sportsbook Pricing

Sportsbooks track their own Brier scores on de-vigged closing lines. Sharp books (Pinnacle, Circa) hit Brier 0.18-0.20 on closing lines, which is why beating the closing line consistently is so hard. Bettors who consistently beat the closing line have implicitly demonstrated Brier scores lower than the book's, which sportsbooks limit accounts for.

How Bookie Bullies Tracks Brier Score

The analytics/calibration.py module computes overall and per-bucket Brier scores nightly from picks_log.jsonl. Current overall Brier is ~0.23 (between "weak" and "good"; the recent NRFI cluster has improved this). The analytics/calibration_report.json shows per-bucket Brier and identifies which probability bands need recalibration. See the methodology page for details on the Platt-scaling recalibration applied to displayed probabilities.

Frequently Asked Questions

What is the Brier score in sports betting?

The Brier score measures how well calibrated probability predictions are to actual outcomes. It's the mean squared difference between predicted probability and outcome (0 or 1). Lower is better. Coin flip baseline is 0.25; strong sports betting models hit 0.18-0.22; weak models are above 0.25.

What's a good Brier score for an MLB pick model?

Good Brier score for MLB pick models: 0.18-0.20 is strong, 0.20-0.22 is solid, 0.22-0.25 is weak (barely above coin flip baseline of 0.25). Anything above 0.25 means the model is worse than random. Sharp MLB models built on sabermetric data with calibration typically hit 0.19-0.21 over full seasons.

Why is Brier score better than hit rate for evaluating a model?

Brier score is better than hit rate because hit rate alone is misleading. A model that always predicts 50% on coin flips has 50% hit rate forever but provides no information. Brier penalizes overconfidence (predicting 80% on things that hit 65%) AND underconfidence. It's the cleanest single number for 'is this model calibrated to reality?'

How do you improve a model's Brier score?

Four main ways to improve Brier score: (1) Platt scaling — logistic regression on historical (predicted_prob, outcome) pairs to recalibrate future predictions; (2) isotonic regression for non-monotonic miscalibration; (3) per-bucket recalibration for different adjustments per probability band; (4) ensemble methods combining models with different miscalibration patterns.

What Brier score does Bookie Bullies have?

Bookie Bullies' overall Brier score is approximately 0.23 (between 'weak' and 'good' on the standard scale). The track record breakdown shows per-bucket Brier identifying which probability bands have the strongest calibration. Platt-scaling recalibration is applied nightly to displayed probabilities to keep the model honest. See methodology for full detail.