Ensemble Models in Production: How We Achieved 71% Accuracy

Deep dive into building and deploying ensemble models with XGBoost, LightGBM, and neural networks for real-world predictions.


The Plateau Problem

Our first SabiScore model—a well-tuned XGBoost classifier—hit 64% accuracy and refused to budge.

We weren’t stuck because XGBoost was bad. We were stuck because one model was being asked to explain a chaotic, noisy world alone.

The fix: ensembles.


Why Ensembles Work in the Real World

Sports outcomes are messy:

No single model architecture will capture all of that. But multiple imperfect models, combined well, can.

At a high level:

Different models make different mistakes. Ensembles learn to cancel out those mistakes.


Our Production Ensemble Architecture

We used a stacking approach:

  1. Train multiple diverse base models (level 0)
  2. Use their predictions as features to train a simple meta-model (level 1)
from sklearn.ensemble import StackingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression

# Base models (Level 0)
base_models = [
    (
        "xgb",
        XGBClassifier(
            n_estimators=500,
            max_depth=4,
            learning_rate=0.01,
            subsample=0.8,
        ),
    ),
    (
        "lgbm",
        LGBMClassifier(
            n_estimators=500,
            num_leaves=31,
            learning_rate=0.01,
        ),
    ),
]

# Meta-learner (Level 1)
meta_model = LogisticRegression(
    C=0.1,
    class_weight="balanced",
)

ensemble = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5,
)

ensemble.fit(X_train, y_train)
print("Test accuracy:", ensemble.score(X_test, y_test))
# ~71% accuracy 🎯

Key points:


Getting the Training Pipeline Right

Ensembles are easy to get wrong in subtle ways.

1. Respect Time

Do not randomly shuffle sports data. Use time-based splits:

split_date = df["date"].quantile(0.8)
train = df[df["date"] < split_date]
valid = df[df["date"] >= split_date]

2. Avoid Leakage Between Levels

When training the meta-learner, you must generate base model predictions using out-of-fold predictions only.

from sklearn.model_selection import KFold
import numpy as np

kf = KFold(n_splits=5, shuffle=False)

oof_preds = np.zeros((len(X_train), len(base_models)))

for fold, (train_idx, valid_idx) in enumerate(kf.split(X_train)):
    X_tr, X_val = X_train[train_idx], X_train[valid_idx]
    y_tr = y_train[train_idx]

    for m_idx, (name, model) in enumerate(base_models):
        model.fit(X_tr, y_tr)
        oof_preds[valid_idx, m_idx] = model.predict_proba(X_val)[:, 1]

# Train meta-learner on oof_preds
meta_model.fit(oof_preds, y_train)

This ensures the meta-model only sees predictions on data the base models haven’t trained on.


Deployment: Keeping Ensembles Practical

More models = more complexity. Here’s how we kept it sane in production.

1. Single Artifact, Not a Zoo

We export a single ensemble object that encapsulates:

class EnsembleModel:
    def __init__(self, base_models, meta_model, preprocessor):
        self.base_models = base_models
        self.meta_model = meta_model
        self.preprocessor = preprocessor

    def predict_proba(self, raw_features: dict) -> list[float]:
        X = self.preprocessor.transform(raw_features)
        base_probs = [m.predict_proba(X)[:, 1] for m in self.base_models]
        stacked = np.vstack(base_probs).T  # shape: (n_samples, n_models)
        return self.meta_model.predict_proba(stacked)[0]

In FastAPI, we just load this EnsembleModel from disk.

2. Cache Hot Paths

Combine ensembles with caching and you get:

async def get_prediction(match_id: str) -> Prediction:
    key = f"pred:{match_id}"
    cached = await redis.get(key)
    if cached:
        return Prediction.parse_raw(cached)

    features = await feature_store.build_for_match(match_id)
    probs = ensemble.predict_proba(features)

    prediction = build_prediction_response(match_id, probs, features)
    await redis.setex(key, 3600, prediction.json())
    return prediction

In production SabiScore traffic, this led to a 73% cache hit rate.


Explainability: Turning Black Boxes into Stories

Users don’t care about log-loss. They care about "Why should I trust this prediction?"

We use feature importances + domain-specific text snippets to build readable explanations.

FEATURE_EXPLANATIONS = {
    "home_form_5": "Home team has strong recent form ({value:.1f} pts/game)",
    "away_form_5": "Away team has weak recent form ({value:.1f} pts/game)",
    "market_home_win_prob": "Betting markets rate home win at {value:.0%}",
}

def generate_reasoning(features: dict, top_k: int = 3) -> list[str]:
    importances = ensemble_feature_importances()  # precomputed
    ranked = sorted(importances.items(), key=lambda kv: kv[1], reverse=True)[:top_k]

    explanations: list[str] = []
    for fname, _ in ranked:
        if fname in FEATURE_EXPLANATIONS:
            template = FEATURE_EXPLANATIONS[fname]
            explanations.append(template.format(value=features[fname]))
    return explanations

A typical reasoning block:


Results

Moving from a tuned single XGBoost to this ensemble gave us:

| Metric | Single Model | Ensemble | |--------|--------------|----------| | Accuracy | 64% | 71% | | Brier score | 0.19 | 0.15 | | Calibration | Overconfident | Well-calibrated |

More importantly:


When Not to Use Ensembles

Don’t reach for ensembles when:

Do reach for them when:


What to Read Next

If you enjoyed this, you’ll probably like:

And if you’re planning an ensemble in production and want a second set of eyes on your design, get in touch.