Your Team's Bus Factor Is Probably 1
The "bus factor" — how many people need to get hit by a bus before a project fails — sounds morbid. But every ops manager thinks about it, even if they call it something nicer.
"What happens if Sarah goes on vacation for two weeks?"
Usually the answer is: things break, escalations pile up, and someone spends the first three days after she returns cleaning up the mess.
Where Bus Factor = 1 Shows Up
It's rarely the obvious stuff. The runbooks cover the common scenarios. The wiki has the architecture diagrams. The on-call rotation exists.
Bus factor = 1 hides in the gaps:
Vendor relationships. Sarah knows that the AWS rep responds faster if you file the ticket as "production impact" even when it isn't. She knows the Datadog contact who can waive overage charges. She knows which Jira fields the compliance team actually reads and which ones they ignore.
Escalation judgment. When a PagerDuty alert fires at 2am, Sarah knows which ones are real emergencies and which ones resolve themselves in 15 minutes. She knows that the database replication lag alert is noise but the queue depth alert means someone needs to wake up right now.
Process workarounds. The deployment pipeline has a known issue where it hangs if you push during the nightly backup window. The monitoring dashboard has a widget that lies when it crosses midnight UTC. The staging environment needs a manual cache clear after config changes or tests fail for reasons that look like code bugs.
These aren't documented because nobody thinks of them as "knowledge." They're just stuff Sarah knows.
The Vacation Test
Here's a quick way to find your bus factor risks: imagine each person on your team takes a surprise two-week vacation starting tomorrow. For each person, list what would break or stall.
If the list for any single person has more than two items, you have a bus factor problem. If the list for one person is significantly longer than everyone else's, that person is a single point of failure.
Most teams find that one or two people hold a disproportionate amount of operational knowledge. Not because they're hoarding it — because nobody ever asked them to share it in a structured way.
Why Documentation Drives Don't Fix This
Every team has tried the "let's document everything" sprint. It produces a flurry of wiki pages that are outdated within a month. The problem isn't motivation — it's format.
Asking someone to write down everything they know is like asking them to write an autobiography. Where do you start? How detailed should it be? What's worth including? The task is so open-ended that people either write too little (high-level overview that doesn't help) or burn out trying to write too much.
The people with the most knowledge are also the busiest. They're the ones handling escalations, mentoring juniors, managing vendor relationships. "Find time to document your knowledge" goes on the backlog and stays there.
What Actually Reduces Bus Factor
The teams that successfully reduce bus factor do something specific: they break knowledge capture into small, focused conversations.
Instead of "document everything about the deployment pipeline," it's "tell me about the last weird deployment issue you debugged." Instead of "write up vendor management," it's "walk me through how you handle AWS support tickets."
Each conversation takes 15-20 minutes and covers one slice of knowledge. Over a few weeks, you build a picture of what someone knows without asking them to sit down and write a manual.
Understudy automates this — it has focused conversations with your team members, asks follow-up questions about edge cases and judgment calls, and organizes the output into searchable knowledge. Your bus factor goes from 1 to "anyone on the team can handle it."
See how Understudy works for ops teams →
Related Resources
See how Understudy compares: