An end-to-end look at what makes ML systems truly production-ready, from architecture patterns to monitoring and maintenance.
Here's a truth bomb: 90% of ML models never make it to production. Not because the models are bad, but because the infrastructure around them is fragile, unmaintainable, or simply non-existent.
I learned this the hard way. My first "production" ML model crashed within 2 hours of deployment. The model was fine—it was everything else that failed.
After shipping SabiScore to 350+ users with 71% prediction accuracy and sub-100ms latency, I've distilled what actually matters when building ML systems that survive contact with real users.
Production-ready isn't about model accuracy alone. It's about:
| Dimension | Question to Ask | |-----------|-----------------| | Reliability | Does it stay up when traffic spikes? | | Latency | Can it respond in <100ms? | | Observability | Do you know when it's failing? | | Maintainability | Can someone else debug it at 3 AM? | | Cost Efficiency | Is it burning money unnecessarily? |
If you can't answer "yes" to all five, you're not production-ready.
┌─────────────────────────────────────────────────────────────┐
│ BATCH PROCESSING │
│ ✅ Use when: Predictions can wait (hours/days) │
│ ✅ Examples: Recommendation systems, fraud detection │
│ ✅ Pros: Cheaper, simpler, more reliable │
│ ❌ Cons: Stale predictions, no real-time feedback │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ REAL-TIME INFERENCE │
│ ✅ Use when: Predictions must be instant │
│ ✅ Examples: Sports predictions, pricing, chat │
│ ✅ Pros: Fresh predictions, interactive │
│ ❌ Cons: Complex, expensive, latency-sensitive │
└─────────────────────────────────────────────────────────────┘
SabiScore uses real-time inference because users need predictions before matches start, not after. But we batch our feature engineering (runs every 6 hours) to reduce compute costs.
This single pattern reduced our compute costs by 73%:
# Redis caching for hot predictions
async def get_prediction(match_id: str) -> Prediction:
# Check cache first
cached = await redis.get(f"pred:{match_id}")
if cached:
return Prediction.parse_raw(cached)
# Cache miss - compute prediction
prediction = await compute_prediction(match_id)
# Cache for 1 hour (predictions don't change often)
await redis.setex(f"pred:{match_id}", 3600, prediction.json())
return prediction
Result: 73% cache hit rate, 87ms average latency (down from 450ms).
The most common mistake I see:
# ❌ BAD: Loading model on every request
@app.post("/predict")
def predict(data: InputData):
model = load_model("model.pkl") # 500ms+ every time!
return model.predict(data)
# ✅ GOOD: Load once at startup
model = None
@app.on_event("startup")
async def load_model_once():
global model
model = load_model("model.pkl")
@app.post("/predict")
def predict(data: InputData):
return model.predict(data) # <10ms
Single models plateau. For SabiScore, XGBoost alone maxed out at 64% accuracy. Here's what worked:
from sklearn.ensemble import StackingClassifier
# Three diverse base models
base_models = [
('xgb', XGBClassifier(n_estimators=500, max_depth=4)),
('lgbm', LGBMClassifier(n_estimators=500, num_leaves=31)),
('nn', MLPClassifier(hidden_layer_sizes=(64, 32)))
]
# Meta-learner combines their predictions
ensemble = StackingClassifier(
estimators=base_models,
final_estimator=LogisticRegression(),
cv=5
)
# Result: 71% accuracy (7% improvement!)
Why this works: Each model captures different patterns. XGBoost handles structured data well, LightGBM is great with imbalanced classes (draws are rare), and the neural network captures subtle interactions.
Your model will degrade. It's not a question of if, but when. Here's what to monitor:
from prometheus_client import Histogram
prediction_distribution = Histogram(
'prediction_confidence',
'Distribution of prediction confidence scores',
buckets=[0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)
@app.post("/predict")
async def predict(data: InputData):
result = model.predict_proba(data)
prediction_distribution.observe(max(result))
return result
Alert when: Average confidence drops below historical baseline.
prediction_latency = Histogram(
'prediction_latency_seconds',
'Time to generate prediction',
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0]
)
Alert when: p95 latency exceeds 200ms.
prediction_errors = Counter(
'prediction_errors_total',
'Prediction errors by type',
['error_type']
)
Alert when: Error rate exceeds 1%.
Before you ship, verify:
Running ML in Nigeria means every dollar counts. Here's how SabiScore runs on $147/month:
| Resource | Choice | Monthly Cost | |----------|--------|--------------| | Compute | DigitalOcean 2vCPU/4GB | $48 | | Database | Managed PostgreSQL | $15 | | Redis | Self-hosted on same droplet | $0 | | CDN | Cloudflare Free | $0 | | Monitoring | Grafana Cloud Free | $0 | | Domain | Namecheap | $12/year | | Total | | $147/month |
Key insight: Start with DigitalOcean, not AWS. You don't need Kubernetes until you have 10,000+ users.
After 8 months of running SabiScore in production:
In the next post, I'll dive deep into building ensemble models for production, including the exact hyperparameters and training pipeline we use at SabiScore.
Questions? Reach out on Twitter or get in touch.
Found this helpful? Share it with someone building production ML systems.