Your Runbooks Are Missing the Hard Parts

Every ops team has runbooks. Step 1, Step 2, Step 3. Check this metric. Run this command. Escalate to this person.

Runbooks work great for the scenarios they cover. The problem is that real incidents rarely follow the script.

What Runbooks Actually Capture

A runbook is a decision tree with one path. If X happens, do Y. It covers the common case, the textbook scenario, the thing that happened last time and will probably happen again the same way.

What runbooks don't capture:

The diagnostic process. The runbook says "check database connectivity." It doesn't say that when connectivity looks fine but queries are slow, you should check if the connection pool is exhausted, which you can verify by looking at the pg_stat_activity view, and which usually means the batch job is running long because someone added a new data source last sprint.

When to deviate. The runbook says to restart the service. But your senior engineer knows that restarting during peak hours can cause a cascade because the load balancer doesn't drain connections fast enough, so you should scale up first, wait for traffic to shift, then restart the old instances one at a time.

The weird stuff. Every system has quirks that don't fit in a runbook. The API gateway occasionally returns 504s that aren't actually timeouts — it's a known bug in the proxy layer that resolves itself. The monitoring spike every Sunday at 3am is the weekly ETL, not an incident. The "critical" alert from the legacy service hasn't been real since the migration in 2024.

The 20/80 Problem

Runbooks cover maybe 20% of incident response — the repeatable, predictable part. The other 80% is pattern recognition, judgment, and institutional memory.

A new team member following the runbook can handle the standard cases. When something unusual happens, they either escalate (adding latency) or guess (adding risk). The runbook doesn't tell them which alerts to worry about, which services are actually fragile, or which "fixes" cause more problems than they solve.

What Senior Engineers Know That Runbooks Don't Say

Ask any senior ops engineer about their system and they'll tell you things like:

"The memory leak in the auth service has been there for two years. It creeps up over about 72 hours. We restart it every Sunday night. If you see memory at 85%, don't panic — just restart early."

"When the CDN throws 403s, check if someone deployed a WAF rule change. Nine times out of ten, it's a regex that's too aggressive in the new rules."

"The third-party API we depend on goes down every quarter for maintenance. They send an email to an alias nobody reads. I just check their status page on the first of each month."

This is operational knowledge. It's not runbook material because it's too specific, too contextual, too much like "stuff you just know." But it's the difference between a 10-minute resolution and a 2-hour investigation.

Building Real Operational Knowledge

The solution isn't better runbooks — it's a layer underneath them. A knowledge base that captures the diagnostic reasoning, the system quirks, the judgment calls that come from experience.

The best way to build this is through structured conversations with your senior people. Not "write everything down" — that never works. More like "tell me about the last three incidents you handled and what wasn't in the runbook."

Understudy does this automatically. It interviews your team about how they actually work — the shortcuts, the judgment calls, the tribal knowledge — and turns it into searchable documentation that supplements your existing runbooks.

See how Understudy works for ops teams →

Related Resources

See how Understudy compares:

Understudy vs Guru