All posts

Technical Debt You Can't See: How Undocumented Systems Compound Risk

Your best backend engineer just gave notice.

In two weeks, they're gone. And with them goes the knowledge of why the payment processing service is architected the way it is, what the workaround for the race condition in the order queue is, and why you can't just "migrate to Postgres" like everyone keeps suggesting.

You scramble. Exit interview. Knowledge transfer sessions. "Document everything!" you say.

They try. They write some README files. They walk the team through the codebase. They answer questions in their last few days.

Then they leave.

Three months later, something breaks in the payment service. The team stares at the code. They kind of understand what it does. They have no idea why it does it that way. They try the "obvious" fix. It works in staging. It causes a cascade failure in production.

This is knowledge debt. And unlike technical debt—which you can see in the codebase—knowledge debt is invisible until it's too late.

The Difference Between Technical Debt and Knowledge Debt

Technical debt is code that works but is hard to maintain. You can grep for it. You can measure it. You can see it in code review.

Knowledge debt is context that exists only in someone's head. You can't measure it. You can't see it. You only notice it when the person who has that context is gone.

Technical debt slows you down.

Knowledge debt is an existential risk.

You can ship a product with messy code. You can't operate a system when nobody knows how it works.

What Knowledge Debt Looks Like

It's not always dramatic. Most of the time, it's mundane:

Engineering:

  • "Why is this caching layer here?" (Because it papers over a database design flaw that would take 3 months to fix properly)
  • "Why don't we use feature flags?" (Because we tried and the override logic caused a production incident)
  • "Can we switch to this new framework?" (We evaluated it 18 months ago, the trade-offs didn't make sense for our use case)

Architecture:

  • "Why is this microservice separate?" (It's not supposed to be—it was an experiment that accidentally became production)
  • "Why do we run this daily job at 3am?" (Because it conflicts with the reporting pipeline, which someone found out the hard way)
  • "Why is authentication handled this way?" (Compliance requirement from a customer contract two years ago)

Incident Response:

  • "This happened before, what did we do?" (Nobody remembers, no runbook exists)
  • "Which service depends on this?" (Three teams have different answers)
  • "Can we just restart it?" (Maybe? Someone knew the answer but they left)

The common thread: The code doesn't explain itself. The person who built it is gone. The knowledge is lost.

Why "Just Write Documentation" Doesn't Work

Every time someone leaves, the same conversation happens:

"We need better documentation."

Someone gets assigned to write docs. They create a Confluence space or a wiki or a README marathon. They write down what they know. It feels productive.

Then nothing changes.

Here's why:

1. Documentation Is Created During Crisis, Not During Work

The best time to document a decision is when you make it. You know the context, the trade-offs, the alternatives you considered.

The worst time is during knowledge transfer when someone's leaving. You're scrambling, trying to remember why you made decisions months or years ago, guessing at what matters.

But that's when documentation gets created—because that's when the pain is visible.

2. Documentation Rots Faster Than Code

Code has enforcement mechanisms. Tests fail. Builds break. Linters complain.

Documentation has nothing. It gets out of sync with reality and nobody notices until someone follows the docs and things break.

A wrong README is worse than no README because it's confidently incorrect.

3. Nobody's Job Is to Maintain It

You can assign someone to "write documentation." You can't assign someone to "keep documentation accurate forever."

As soon as the crisis passes, documentation stops being a priority. The wiki becomes a graveyard of outdated pages that nobody trusts.

4. The Wrong Things Get Documented

People document how the code works. That's what code is for—code is the documentation of how the system behaves.

What's missing is why:

  • Why did we choose this approach?
  • What did we try that didn't work?
  • What assumptions does this rely on?
  • What's the thing that looks weird but is actually critical?

That context is what disappears when people leave.

What to Document (and What Not To)

Not everything needs documentation. Most code is self-explanatory if it's written well.

Here's what actually matters:

Document Decisions, Not Behavior

Don't document: "The UserService class handles user CRUD operations. It has methods for create, read, update, and delete."

That's what the code says. If someone can't figure that out from reading the code, the code is the problem.

Do document: "We use UUID v7 for user IDs instead of auto-incrementing integers. We switched in Q3 2024 because sequential IDs were leaking growth metrics to competitors who scraped the API. Old users still have integer IDs—see migration plan in [link]."

That's context the code can't capture.

Document Non-Obvious Trade-Offs

Don't document: "This function validates email addresses."

Do document: "Email validation is intentionally lenient—we only check for '@' and a dot. Strict RFC 5322 validation broke signups for international users with valid but uncommon email formats. We decided false positives (bad email gets through) are better than false negatives (real user can't sign up). See incident post-mortem: [link]."

Now when someone suggests "let's use a proper email validation library," they understand why you don't.

Document Workarounds and Hacks

If there's something in the codebase that looks wrong but is actually critical, document why it exists.

"This sleep(500ms) looks like a race condition fix, but it's actually rate-limit handling for the external API. Removing it will cause 429 errors. The right fix is to implement exponential backoff, but that's blocked on [issue #1234]."

Document Dependencies and Assumptions

"This service assumes the cache is always available. If Redis goes down, requests will fail—there's no fallback. This is intentional; see the architecture decision record about consistency vs. availability trade-offs."

Document Things That Failed

"We tried implementing feature flags using [library X]. It looked great in theory but caused a production incident when the override logic interacted badly with our multi-region setup. We rolled back and decided to build a simpler version ourselves. See: [incident report]."

This stops the next person from making the same mistake.

When to Document

The right time to document is when you make the decision, not when someone's about to leave.

Here's the habit:

After Every Significant Decision

When you:

  • Choose an architecture approach
  • Decide not to use a technology
  • Implement a workaround
  • Make a trade-off between competing concerns

Write down:

  • What you decided
  • Why you decided it
  • What alternatives you considered
  • What would make you revisit this decision

This takes 5 minutes. It saves weeks of confusion later.

After Every Production Incident

Incident post-mortems should answer:

  • What broke
  • Why it broke
  • What we did to fix it
  • What we're changing to prevent it

But also: What we learned about how the system actually works.

Incidents reveal assumptions. Document them.

After Every "Wait, Why Is This Here?" Moment

When someone asks "why do we do it this way?" and the answer isn't obvious from the code, write it down.

Right then. Not later.

Where to Document

The best place to document something is as close to the code as possible.

In-Code Comments for Context

Use comments for why, not what:

# We hash passwords with bcrypt instead of argon2 because
# argon2 memory requirements caused OOM errors under peak load.
# Revisit when we move to larger instances. See: INFRA-456
hash_password(password, algorithm='bcrypt')

Architecture Decision Records (ADRs)

For significant decisions, use ADRs—lightweight docs that capture:

  • Context: What's the situation?
  • Decision: What did we decide?
  • Consequences: What are the trade-offs?

Keep them in the repo as markdown files. When someone asks "why did we build it this way?" point them to the ADR.

Runbooks for Operational Knowledge

For "what to do when X happens," create runbooks:

  • Service won't start → check Y, then Z
  • Performance degradation → check these metrics, try these fixes
  • Data inconsistency → here's the recovery process

Store them with your infrastructure-as-code or in your incident management tool.

README Files for High-Level Context

Use README files to answer:

  • What is this service/repo/component?
  • Why does it exist?
  • What does it depend on?
  • What depends on it?
  • Where to go for more info?

Keep it short. If it's more than 200 lines, break it into separate docs.

Making Documentation a Habit, Not a Project

Documentation fails when it's treated as a separate phase of work. "We'll document it later." "Let's do a docs sprint."

It works when it's part of the process:

Include It in Code Review

Ask: "Is the context here obvious? If I read this code in six months, will I understand why it's written this way?"

If the answer is no, request documentation.

Include It in Definition of Done

A feature isn't done until:

  • The code is written
  • Tests pass
  • Relevant decisions are documented

Not "fully documented." Just: "Why did we build it this way?" is captured somewhere.

Assign Ownership

Someone needs to be responsible for:

  • Maintaining ADRs
  • Keeping runbooks up to date
  • Flagging when docs go stale

This doesn't need to be full-time. But it needs to be someone's explicit responsibility.

The Hidden Cost of Knowledge Debt

Here's what happens when knowledge lives only in people's heads:

You can't onboard new engineers effectively. They ask questions, get partial answers, piece together understanding slowly.

You make the same mistakes twice. Someone tries something that failed before, but nobody remembers it failed.

You can't make good decisions. You don't know the constraints, the history, the reasons things are the way they are.

You're terrified of people leaving. When someone gives notice, you panic—because they're not just leaving, they're taking critical knowledge with them.

You accumulate complexity. New code gets layered on top of old code that nobody fully understands. The system becomes increasingly fragile.

This is the compounding cost of knowledge debt. Like financial debt, a little bit is manageable. But it grows exponentially if you don't pay it down.

Documentation as Risk Mitigation

Here's the shift in mindset:

Documentation isn't busywork. It's not a nice-to-have. It's not something you do when you have time.

Documentation is insurance against catastrophic knowledge loss.

When someone leaves, gets hit by a bus, or simply forgets, the documentation is what keeps the system running.

When you're making a decision under pressure, the documentation is what tells you what was already tried and why it didn't work.

When you're scaling the team, the documentation is what lets new people contribute without needing six months of oral history.

The question isn't "do we have time to document this?"

The question is "can we afford not to?"

Start Small

You don't need to document everything tomorrow. Start with:

  1. Document the next decision you make. Just one. Write down why.
  2. After the next incident, capture what you learned about how the system works.
  3. Next time someone asks "why is this here?" write down the answer instead of just saying it.

Do this consistently for a month. You'll have more useful documentation than most teams create in a year.

Knowledge debt doesn't get paid off in a single sprint. It gets paid off one decision, one incident, one "why?" at a time.

The best time to start was when you wrote the first line of code.

The second-best time is now.


Understudy helps engineering teams capture critical knowledge during the work, not after someone leaves. See how it works or explore pricing.

Get early access to Understudy

Turn your team's tribal knowledge into structured playbooks. Join the waitlist — we're onboarding teams now.