How I took a real ML API from constant outages and angry users to 99.9% uptime in a month c and the exact playbook you can steal.
In 2024 I worked with a team that had what most ML engineers would call a good problem:
But in production, their ML API was a mess:
The founders summary was brutal:
We dont have an ML problem. We have a reliability problem.
This post is the exact 4-week playbook we used to go from 1cuptime to 99.9% uptime, without rewriting the entire system or spending enterprise money.
If youre running an ML API that keeps breaking at the worst possible time, steal this.
Before changing infrastructure, we needed to know what was actually happening.
We added three things:
Uptime monitoring (UptimeRobot)
/health every 10 minutes.Error tracking (Sentry)
Latency + error metrics (Prometheus + Grafana)
Why this matters:
After 7 days, the picture was clear:
Patterns we saw:
Key insight: most random incidents had boring, repeatable causes.
Before:
[Clients] e [API Server] e [Postgres]
e [ML Model in Process]
If that single server died, everything died.
After:
[Clients]
e [Cloudflare]
e [NGINX Load Balancer]
e [API 1]
e [API 2]
e [Redis Cache]
e [Postgres Primary] e [Postgres Standby]
Changes we made:
This alone doesnt give you 99.9% uptime, but it removes the most obvious single points of failure.
Downtime isnt just server is dead. Its also:
We tightened up the edge using NGINX:
upstream api_backend {
server api1.internal:8000 max_fails=3 fail_timeout=30s;
server api2.internal:8000 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name api.example.com;
location / {
proxy_pass http://api_backend;
proxy_next_upstream error timeout http_500 http_502 http_503;
proxy_connect_timeout 2s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
}
location /health {
access_log off;
return 200 "healthy\n";
}
}
Key principles:
Result after Week 2:
We formalized health states:
Green:
Yellow (degraded):
Red:
Each state mapped to an action plan:
UptimeRobot
/health every 60 seconds.Sentry
Prometheus + Grafana
Dashboards we built:
Suddenly, incidents had timelines instead of guesses:
At 14:02, latency spiked. At 14:03, DB connections maxed out. At 14:04, health checks failed.
Most of this teams incidents were self-inflicted:
We changed the rules:
A simplified version of the model registry pattern:
class ModelRegistry:
def __init__(self, model_dir: Path):
self.model_dir = model_dir
self.blue_model = self._load_latest()
self.green_model = None
self.active = "blue"
def _load_latest(self):
model_files = sorted(self.model_dir.glob("model_v*.pkl"))
return joblib.load(model_files[-1])
def update_model(self, new_model_path: Path):
candidate = joblib.load(new_model_path)
if not self._passes_smoke_tests(candidate):
raise ValueError("New model failed validation")
if self.active == "blue":
self.green_model = candidate
self.active = "green"
else:
self.blue_model = candidate
self.active = "blue"
def predict(self, features):
model = self.blue_model if self.active == "blue" else self.green_model
return model.predict(features)
For the API:
We wrote a one-page incident runbook:
And we used a lightweight post-mortem template for serious incidents:
One real example:
Before:
After 4 weeks:
You can follow the exact same arc:
If your system already has traffic and revenue, this work usually pays for itself in one avoided outage.
I work with teams who have working ML systems but unreliable production environments.
If youre:
I offer:
Want to chat?
MLOps Uptime Review.If you know a team whose ML system keeps going down, send them this post. It might save their next launch.