0 c 99.9% Uptime: Turning a Flaky ML API Into a Reliable Product in 4 Weeks

How I took a real ML API from constant outages and angry users to 99.9% uptime in a month c and the exact playbook you can steal.


The Problem: When It Works on My Machine Starts Costing Real Money

In 2024 I worked with a team that had what most ML engineers would call a good problem:

But in production, their ML API was a mess:

The founders summary was brutal:

We dont have an ML problem. We have a reliability problem.

This post is the exact 4-week playbook we used to go from 1cuptime to 99.9% uptime, without rewriting the entire system or spending enterprise money.

If youre running an ML API that keeps breaking at the worst possible time, steal this.


Week 1 c Baseline: Measure Before You Touch Anything

Step 1: Instrument the Current Reality

Before changing infrastructure, we needed to know what was actually happening.

We added three things:

  1. Uptime monitoring (UptimeRobot)

  2. Error tracking (Sentry)

  3. Latency + error metrics (Prometheus + Grafana)

Why this matters:

Step 2: First-Week Reality Check

After 7 days, the picture was clear:

Patterns we saw:

Key insight: most random incidents had boring, repeatable causes.


Week 2 c Architecture: Remove Single Points of Failure

Step 3: Redesign the API Topology (Without a Rewrite)

Before:

[Clients] e [API Server] e [Postgres]
                          e [ML Model in Process]

If that single server died, everything died.

After:

[Clients]
   e [Cloudflare]
        e [NGINX Load Balancer]
              e [API 1]
              e [API 2]
                    e [Redis Cache]
                    e [Postgres Primary] e [Postgres Standby]

Changes we made:

This alone doesnt give you 99.9% uptime, but it removes the most obvious single points of failure.

Step 4: Timeouts, Retries, and Health Checks

Downtime isnt just server is dead. Its also:

We tightened up the edge using NGINX:

upstream api_backend {
    server api1.internal:8000 max_fails=3 fail_timeout=30s;
    server api2.internal:8000 max_fails=3 fail_timeout=30s;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    location / {
        proxy_pass http://api_backend;
        proxy_next_upstream error timeout http_500 http_502 http_503;
        proxy_connect_timeout 2s;
        proxy_send_timeout 10s;
        proxy_read_timeout 10s;
    }

    location /health {
        access_log off;
        return 200 "healthy\n";
    }
}

Key principles:

Result after Week 2:


Week 3 c Monitoring That Prevents 3 AM Surprises

Step 5: Define Red / Yellow / Green States

We formalized health states:

Each state mapped to an action plan:

Step 6: Concrete Monitoring Setup

UptimeRobot

Sentry

Prometheus + Grafana

Dashboards we built:

Suddenly, incidents had timelines instead of guesses:

At 14:02, latency spiked. At 14:03, DB connections maxed out. At 14:04, health checks failed.


Week 4 c Deployments & Incidents: Stop Breaking Prod Yourself

Step 7: Zero-Downtime Deploys for Models and APIs

Most of this teams incidents were self-inflicted:

We changed the rules:

A simplified version of the model registry pattern:

class ModelRegistry:
    def __init__(self, model_dir: Path):
        self.model_dir = model_dir
        self.blue_model = self._load_latest()
        self.green_model = None
        self.active = "blue"

    def _load_latest(self):
        model_files = sorted(self.model_dir.glob("model_v*.pkl"))
        return joblib.load(model_files[-1])

    def update_model(self, new_model_path: Path):
        candidate = joblib.load(new_model_path)

        if not self._passes_smoke_tests(candidate):
            raise ValueError("New model failed validation")

        if self.active == "blue":
            self.green_model = candidate
            self.active = "green"
        else:
            self.blue_model = candidate
            self.active = "blue"

    def predict(self, features):
        model = self.blue_model if self.active == "blue" else self.green_model
        return model.predict(features)

For the API:

Step 8: Incident Runbook + Post-Mortems

We wrote a one-page incident runbook:

  1. Confirm its not a false alarm.
  2. Check:
  3. Decide:
  4. Communicate:

And we used a lightweight post-mortem template for serious incidents:

One real example:


Results: Before vs After

Hard Numbers

Before:

After 4 weeks:

Soft Wins


Your 4-Week 99.9% Uptime Plan (Copy-Paste)

You can follow the exact same arc:

If your system already has traffic and revenue, this work usually pays for itself in one avoided outage.


Need Help Getting There Faster?

I work with teams who have working ML systems but unreliable production environments.

If youre:

I offer:

Want to chat?


If you know a team whose ML system keeps going down, send them this post. It might save their next launch.