Meta Description: Learn the exact MLOps strategies I use to maintain 99.9% uptime for my ML system serving 350+ users. Covers monitoring, deployment, incident response, and cost optimization.
Slug: /mlops-playbook-999-uptime-production-ml-systems
Reading Time: 10 min

Here's what happened at 2:47 AM on a Tuesday in July 2024:
My phone explodes with notifications. The ML prediction API is down. 350+ users can't get predictions. My Discord is on fire. My email inbox is filling up faster than I can read.
Downtime duration: 23 minutes
Users affected: 350+
Revenue lost: ~$40
Trust lost: Incalculable
The irony? My ML model had 72% accuracy that week—the highest ever. But nobody cares about accuracy when your system is down.
This incident taught me a brutal lesson: In production ML, reliability > accuracy.
Fast forward to today, my system has maintained 99.9% uptime for 12 consecutive months. That's only 43 minutes of total downtime in an entire year (all planned maintenance).
This post is the complete playbook I wish I had when I started. No fluff. Just battle-tested strategies from keeping a real ML system running 24/7.
After one year and 350+ production users, I've distilled everything into 5 pillars:
Let's dive into each one.
Rule #1: Assume everything will fail. Design accordingly.
[Cloudflare CDN]
v
[Load Balancer (NGINX)]
/ \
[API Instance 1] [API Instance 2]
v v
[Redis Primary] [Redis Replica]
v v
[PostgreSQL Primary] -> [PostgreSQL Standby]
Key Design Decisions:
Why it matters: If one API instance crashes, the other handles traffic seamlessly.
My setup:
# NGINX load balancer config
upstream api_backend {
server api1.internal:8000 max_fails=3 fail_timeout=30s;
server api2.internal:8000 max_fails=3 fail_timeout=30s;
# Health check endpoint
keepalive 32;
}
server {
listen 443 ssl http2;
server_name api.sabiscore.com;
location / {
proxy_pass http://api_backend;
proxy_next_upstream error timeout http_500 http_502 http_503;
proxy_connect_timeout 2s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
}
# Health check endpoint
location /health {
access_log off;
return 200 "healthy\n";
}
}
Cost: $48/month for 2 instances (vs. $24 for 1)
Downtime prevented: Saved me 6x in the past year
Why it matters: Primary database crashes? Standby takes over in under 30 seconds.
PostgreSQL Streaming Replication Setup:
# On primary database
# postgresql.conf
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB
# pg_hba.conf (allow replica connections)
host replication replicator standby_ip/32 md5
# On standby database
# recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=primary_ip port=5432 user=replicator password=xxx'
trigger_file = '/tmp/postgresql.trigger.5432'
Failover process (automated):
I use: DigitalOcean Managed PostgreSQL (handles this automatically for $15/month)
Why it matters: 90% of requests hit cache. If Redis goes down, API latency explodes from 87ms to 800ms.
Redis persistence strategy:
# redis.conf
# RDB snapshots (backup every 5 minutes if >=1 key changed)
save 300 1
# AOF (append-only file for durability)
appendonly yes
appendfsync everysec
# Replication for high availability
replicaof redis-primary.internal 6379
replica-read-only yes
Cache invalidation strategy:
# Smart cache invalidation (prevents stale predictions)
import redis
from datetime import datetime, timedelta
redis_client = redis.Redis(host='localhost', decode_responses=True)
def cache_prediction(game_id, prediction, expiry_minutes=60):
"""Cache prediction with smart TTL based on game time"""
game_time = get_game_start_time(game_id)
time_until_game = (game_time - datetime.now()).total_seconds() / 60
# Cache expires 5 minutes before game starts (to account for late news)
ttl = min(expiry_minutes, max(5, time_until_game - 5)) * 60
redis_client.setex(
f"prediction:{game_id}",
int(ttl),
json.dumps(prediction)
)
Result: Cache hit rate of 89% → saves database queries + improves latency
Why it matters: One DDoS attack can take down your site. Cloudflare blocks it before it reaches your servers.
Free tier includes:
Rate limiting rules I use:
// Cloudflare rate limiting rule
// Prevents API abuse (max 100 requests per 10 minutes per IP)
(http.request.uri.path contains "/api/predict") and
(rate(1m) > 100)
// Action: Block for 30 minutes
Cost: $0 (free tier is sufficient for my scale)
Attacks blocked in past year: 12 (would've caused downtime without Cloudflare)
Rule #2: If you're not monitoring it, it will break at 3 AM.
| Tool | Purpose | Cost | |------|---------|------| | UptimeRobot | HTTP endpoint monitoring | Free (50 monitors) | | Sentry | Error tracking & alerts | Free (under 10K errors/month) | | Prometheus + Grafana | Custom metrics | Self-hosted (free) | | PostgreSQL Stats | Database performance | Built-in (free) | | CloudWatch | Infrastructure metrics | AWS free tier |
# health_check.py
from fastapi import FastAPI, HTTPException
from datetime import datetime
import psutil
app = FastAPI()
@app.get("/health")
async def health_check():
"""Comprehensive health check endpoint"""
checks = {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"checks": {}
}
# Check 1: Database connectivity
try:
db.execute("SELECT 1")
checks["checks"]["database"] = "ok"
except Exception as e:
checks["status"] = "unhealthy"
checks["checks"]["database"] = f"error: {str(e)}"
# Check 2: Redis connectivity
try:
redis_client.ping()
checks["checks"]["redis"] = "ok"
except Exception as e:
checks["status"] = "unhealthy"
checks["checks"]["redis"] = f"error: {str(e)}"
# Check 3: ML model loaded
if model_registry.active_model is None:
checks["status"] = "unhealthy"
checks["checks"]["model"] = "not loaded"
else:
checks["checks"]["model"] = "ok"
# Check 4: System resources
cpu_percent = psutil.cpu_percent(interval=1)
memory_percent = psutil.virtual_memory().percent
if cpu_percent > 90:
checks["status"] = "degraded"
checks["checks"]["cpu"] = f"high: {cpu_percent}%"
else:
checks["checks"]["cpu"] = f"ok: {cpu_percent}%"
if memory_percent > 90:
checks["status"] = "degraded"
checks["checks"]["memory"] = f"high: {memory_percent}%"
else:
checks["checks"]["memory"] = f"ok: {memory_percent}%"
# Check 5: Disk space
disk_percent = psutil.disk_usage('/').percent
if disk_percent > 85:
checks["status"] = "warning"
checks["checks"]["disk"] = f"high: {disk_percent}%"
else:
checks["checks"]["disk"] = f"ok: {disk_percent}%"
# Return 200 for healthy/degraded, 503 for unhealthy
status_code = 200 if checks["status"] in ["healthy", "degraded"] else 503
return JSONResponse(content=checks, status_code=status_code)
UptimeRobot setup:
/health endpoint every 5 minutesThis is where most ML teams fail. They monitor infrastructure but not the actual model.
# model_monitor.py
import pandas as pd
from datetime import datetime, timedelta
class ModelPerformanceMonitor:
def run_diagnosis(self):
"""Run automated diagnostic checks"""
# Check 1: Are servers responding?
health_status = self.check_health_endpoints()
# Check 2: Database connectivity?
db_status = self.check_database()
# Check 3: Redis availability?
cache_status = self.check_redis()
# Check 4: Recent deployments?
recent_deploys = self.check_deployment_history(hours=2)
# Check 5: Traffic spike?
traffic_anomaly = self.detect_traffic_anomaly()
# Check 6: Error logs?
recent_errors = self.analyze_error_logs(minutes=30)
# Generate diagnosis report
return {
"health": health_status,
"database": db_status,
"cache": cache_status,
"recent_deploys": recent_deploys,
"traffic_anomaly": traffic_anomaly,
"errors": recent_errors
}
Phase 4: Resolution
Phase 5: Post-Mortem (Within 48 hours)
Template I use:
# Incident Post-Mortem: [Brief Description]
**Date:** YYYY-MM-DD
**Duration:** XX minutes
**Severity:** PX
**Users Affected:** XXX
## What Happened?
[Timeline of events]
## Root Cause
[Technical explanation]
## Why It Wasn't Caught Earlier
[Monitoring gaps]
## Resolution
[What fixed it]
## Action Items
- [ ] Improve monitoring for [X]
- [ ] Add automated test for [Y]
- [ ] Update runbook with [Z]
## Prevention
[How we'll prevent this in future]
What happened:
Root cause:
Resolution:
Prevention measures implemented:
Cost of prevention: $0 (just configuration)
Incidents prevented since: 4 (based on alerts that triggered cleanup)
Rule #5: 99.9% uptime doesn't require enterprise pricing.
| Component | Service | Cost | Uptime Contribution | |-----------|---------|------|-------------------| | API Servers (2x) | DigitalOcean Droplets | $48 | Redundancy prevents single-instance failures | | Database | DigitalOcean Managed PostgreSQL | $15 | Auto-failover prevents DB outages | | Caching | Redis Cloud | $5 | Reduces load on primary DB | | CDN/DDoS | Cloudflare | $0 | Blocks attacks that would cause downtime | | Monitoring | UptimeRobot + Sentry | $0 | Early detection prevents prolonged outages | | Backups | DigitalOcean Snapshots | $2 | Disaster recovery | | Total | — | $70/month | — |
For comparison:
1. Use Managed Services for Critical Components
2. Self-Host Non-Critical Services
3. Aggressive Caching
4. Cloudflare Free Tier is Amazing
5. Spot Instances for Batch Jobs
Result: 99.9% uptime for $70/month
✅ Redundancy everywhere: Every single-point-of-failure eliminated saved me multiple outages
✅ Proactive monitoring: 80% of issues caught before users noticed
✅ Automated health checks: Detected problems at 3 AM when I was asleep
✅ Post-mortems: Each incident made system more resilient
✅ Managed services: $15/month PostgreSQL prevented countless self-inflicted wounds
❌ Manual deployments: Caused 2 outages due to human error
❌ Weekend deploys: Murphy's Law is real. Only deploy Mon-Thu.
❌ Ignoring disk space: Caused my worst outage (23 minutes)
❌ Skipping smoke tests: Deployed bad model once, caught in production
❌ Complex orchestration: Kubernetes was overkill and increased failure modes
🔍 Most outages are self-inflicted
🔍 Latency matters more than uptime
🔍 Caching solves 80% of scale problems
Use this as your starting point:
Can't afford enterprise tools? Neither could I. Here's my free/cheap stack:
GitHub repo with my full monitoring setup:
github.com/scardubu/ml-monitoring-stack ⭐ Star if useful!
Building production ML systems? I offer:
MLOps Architecture Review ($200/hour)
Full MLOps Consulting (Custom pricing)
Mentorship for ML Engineers ($100/month)
What's your ML system's uptime? What's your biggest production challenge?
Drop a comment below or reach out:
Share this post if it helped:
Tweet | LinkedIn
Last updated: December 2025 | Reading time: 10 minutes