How Our Algorithm Called 22 of 24 Winners

The Headline Number

When the final envelope was opened at the 98th Academy Awards, our prediction model had correctly identified the winner in 22 out of 24 categories. That's a 91.7% accuracy rate — a number we're proud of, and one that validated months of research, data collection, and iterative model building.

But a number alone doesn't tell the story. What makes a prediction model work isn't magic or insider knowledge — it's understanding the patterns that have held across nearly a century of voting behavior, and knowing which signals matter most in which categories.

The Data That Feeds the Model

Our model ingests data from multiple sources, each contributing a different piece of the puzzle. No single data point is sufficient on its own, but together they paint a remarkably clear picture of how Academy voters are likely to behave.

Guild awards form the backbone of the model. The Screen Actors Guild (SAG), Directors Guild of America (DGA), Producers Guild of America (PGA), and Writers Guild of America (WGA) awards are voted on by professionals who significantly overlap with Academy membership. When the SAG ensemble winner aligns with the PGA and DGA winners, it's an extraordinarily strong signal for Best Picture. In acting categories, the SAG individual awards are the single most predictive precursor — historically correlating with the eventual winner over 80% of the time.

BAFTA results serve as a crucial international barometer. While British voters have their own preferences, the overlap with the Academy has grown substantially in recent years as the Academy's international membership has expanded. A BAFTA win in a technical category is especially predictive.

Critic aggregations from Metacritic and Rotten Tomatoes provide a measure of critical consensus. These scores matter less in isolation but become significant when combined with audience reception data. A film with strong critical scores and robust guild support is nearly always the frontrunner.

Box office performance carries more weight than many pundits acknowledge, particularly for Best Picture. Voters watch screeners, but a film's cultural footprint — its penetration into the broader conversation — is partially reflected in its commercial performance. The model weights this signal carefully: too much box office emphasis would bias toward blockbusters, too little would ignore a genuine indicator of voter awareness.

Historical voting patterns provide the contextual foundation. The model draws on decades of data to identify category-specific trends: how often the DGA winner matches the Best Director winner, the likelihood of a Best Picture nominee sweeping its technical categories, the correlation between nomination count and win probability.

Architecture: A Weighted Ensemble

Rather than building a single monolithic model, we use a weighted ensemble approach — essentially, a collection of specialized sub-models, each tuned to its category.

This design reflects a fundamental insight: different categories are driven by different signals. The factors that predict Best Picture are not the same factors that predict Best Costume Design. Acting categories are heavily influenced by guild results and narrative momentum. Technical categories lean more on BAFTA results and nomination patterns. Best Picture sits at the intersection of nearly everything.

Each sub-model produces a probability distribution across the nominees in its category. These probabilities are then calibrated against historical accuracy to produce the final confidence scores you see in the app. A confidence score of 85% means that, historically, when our model has expressed that level of confidence, the pick has been correct roughly 85% of the time.

Confidence Calibration

One of the most important — and least glamorous — aspects of the model is confidence calibration. It's not enough to pick the right winner; the model needs to accurately communicate how certain it is about each pick.

At the 98th ceremony, our calibration held up well:

High confidence picks (>80%): 100% correct. When the model was strongly confident, it delivered.
Medium confidence picks (50–80%): Correct in most cases, with both misses falling in this range.
Lower confidence picks (<50%): These are the races the model flagged as genuine toss-ups, and accuracy here was closer to coin-flip territory — exactly as expected.

This calibration is arguably more valuable than raw accuracy. A model that picks 24/24 but assigns 95% confidence to every pick is less useful than one that goes 22/24 but accurately flags which races are tight. You want to know where the surprises might come from, not just be told that everything is a lock.

What We Got Wrong

Two misses out of 24 is a strong result, but the misses are where the learning happens.

Our first miss came in a tight race in a technical category where two nominees had nearly identical precursor profiles. The guild winner and the BAFTA winner were different films, and our model sided with the guild signal — which, in this case, turned out to be the wrong call. The margin between the top two candidates in our model was less than five percentage points, which means this was always going to be a coin flip.

The second miss was a genuine upset in a below-the-line category. The eventual winner had modest precursor support but benefited from a strong narrative within the industry — the kind of word-of-mouth momentum that doesn't show up in aggregated data. This is a known blind spot: our model excels at quantifying public signals but struggles with the private conversations that sometimes drive voting in smaller categories.

Both misses were in categories where the model expressed moderate confidence, not high confidence. In other words, the model knew these races were uncertain — it just picked the wrong side of the uncertainty.

Where We Go from Here

A 91.7% accuracy rate is a strong foundation, but there's room to improve. Our roadmap for the next iteration includes:

Expanded precursor data: Incorporating additional regional critic awards and international film festival results to capture a broader signal set
Deeper historical modeling: Extending our training data further back and weighting recent ceremonies more heavily to capture evolving voter demographics
Real-time odds integration: Incorporating betting market data as a complementary signal, since prediction markets aggregate information from a wide range of sources
Narrative momentum scoring: Developing a quantitative proxy for the qualitative "buzz" that drives some categories — our biggest blind spot this year

The goal isn't perfection. In a system with thousands of voters and genuine uncertainty, some categories will always be unpredictable. The goal is to give you the clearest possible picture of the landscape heading into the ceremony — so that when the envelope is opened, you understand why the result happened, whether it was expected or not.

For a complete technical breakdown of the model architecture, data pipeline, and backtesting methodology, visit our full methodology page.

The Headline Number

The Data That Feeds the Model

Architecture: A Weighted Ensemble

Confidence Calibration

What We Got Wrong

Where We Go from Here

See the Predictions in Action

More from the Archives

The Night They Read the Wrong Envelope

10 Records That May Never Be Broken