How We Trace Production Incidents Back to the Code Change That Caused Them

Before I started Kestrel, I was a software engineer at a genomics platform company called DNAnexus. I had been working on a feature to enable one of their largest customers to run more complex bioinformatics workflows. It required significant changes to how jobs were scheduled and what resources they could access. Big feature with a high-stakes customer. This is the kind of feature that requires careful planning and slow rollouts.

Working together with the QA team we ran through every test case we could think of. Happy paths, edge cases, and failure modes. We deployed it to staging and left it running for extended testing. The QA team ran through multiple iterations. We let it sit for a total of 6 whole weeks to make sure stability of the platform was not affected before rolling it to production. Upon upgrading, jobs on the platform started failing left and right.

DNAnexus Status Page, May 9, 2023

22:31 UTC  "We identified an issue after our weekly deploy
            causing some new jobs to fail. We have rolled
            back the change and new jobs are succeeding."

01:16 UTC  "New jobs continue to succeed and the platform
            is operating normally."

The status page tells the clean version. Behind the scenes, the platform team spent about 45 minutes going back and forth trying to determine what had caused this. The change had been in staging so long that it was not the obvious "what did we deploy today?" suspect. Nobody immediately thought of the workflow change that had been "proven stable" for a month and a half.

The root cause? Production jobs had fields that staging jobs did not have. Templates that had been in use for years carried data shapes that our staging environment had never seen. The change worked perfectly against staging's clean, recent data. It broke immediately against production's older, messier, real-world data.

And once we identified the problem, we still had to figure out exactly what to roll back to. The change spanned multiple files. The deploy bundle included other changes too. It was not as simple as "revert the last commit."

The Git Blame Detective Game

This investigation followed a pattern every on-call engineer knows. First someone opens the failing service's repository. Run git log --oneline -30 to see recent merges. Squint at PR titles trying to guess which one touched that file. Click through 5 to 10 PRs, scanning diffs for anything related. Ask in Slack: "Did anyone change anything in the profile service recently?" Get three different answers. Investigate all of them. Finally find the PR that introduced the bug 12 days ago.

This process takes 20 to 60 minutes on a good day. On a bad day, when the change is old, when it spans multiple repos, when the person who wrote it is on vacation, it takes hours. And the whole time production is down or degraded.

The problem is not that engineers are slow. The problem is that the investigation is manual, sequential, and requires context that no single person has. You need to know which files are in the stack trace and which PRs touched those files recently. Then figure out what those PRs actually changed, not just the title and whether the change could explain the failure mode you are seeing.

That is a correlation problem.

How Kestrel Traces Incidents Back to Code Changes

When Kestrel detects an incident, a pod in CrashLoopBackOff, a spike in 500 errors, a deployment that will not roll out, it kicks off an AI-powered root cause analysis. The RCA agent investigates iteratively: checking pod status, reading logs, querying metrics, drilling deeper with each turn.

When the RCA agent determines that the failure is an application-level problem (a code bug, not an infrastructure problem), two things happen in parallel:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│   RCA Agent     │────▶│  Kestrel Server  │────▶│  In Parallel:       │
│                 │     │                  │     │                     │
│ Detects app     │     │ Auto triggers    │     │  1. Code Fix Agent  │
│ level failure   │     │ both agents      │     │     Reads source    │
│                 │     │                  │     │     Generates fix   │
│ Extracts:       │     │                  │     │     Creates PR      │
│  Stack trace    │     │                  │     │                     │
│  Error context  │     │                  │     │  2. Causal PR Agent │
│  Source files   │     │                  │     │     Lists PRs/MRs   │
│                 │     │                  │     │     Fetches diffs   │
│                 │     │                  │     │     Identifies      │
│                 │     │                  │     │     cause of failure│
└─────────────────┘     └──────────────────┘     └─────────────────────┘

The Causal PR Search Agent

The Causal PR Search Agent does what you do manually. But across all connected repositories. In seconds.

Phase 1 Discover: The agent lists all merged PRs and direct commits to the default branch within a configurable time window. It works with both GitHub and GitLab.

Phase 2 Filtering: For each PR, the agent checks which files were changed and cross-references them against the suspected source files from the stack trace. PRs that touch relevant files get promoted for deeper analysis.

Phase 3 Analysis: The agent fetches the full unified diff of suspicious PRs and commits. The LLM reads the actual code changes and correlates them with the error context. Not just filenames, but the specific logic changes that could explain the failure.

Phase 4 Conclusion: Each candidate is returned with an explanation of why this change likely caused the incident.

The Code Fix Agent

While the Causal PR Agent identifies what broke, the Code Fix Agent figures out how to fix it. Running in parallel, it:

Searches across all connected repositories (not just one) to find the source files referenced in the stack trace
Reads the relevant source code via GitHub or GitLab API
Analyzes the error context, stack trace, and application logs
Generates a multi-file code fix with explanations
Presents a diff view in the incident detail page
Lets you create a PR or MR directly from the Kestrel UI with one click

The cross-repository search is critical. In microservice architectures, the container path in a stack trace (/app/services/profile_service.py) often does not match the repository structure. The agent searches all connected repos to find the actual file, regardless of how the Docker build rearranged things.

What This Would Have Looked Like at DNAnexus

Let's replay my DNAnexus incident with Kestrel in the loop.

Timeline	Without Kestrel	With Kestrel
T+0 min	Jobs start failing after deploy	Jobs start failing after deploy
T+2 min	Monitoring alerts fire, on call paged	Kestrel detects incident, RCA agent begins investigation
T+5 min	Team checks dashboards, confirms jobs failing	RCA identifies app-level failure from logs. Causal PR and Code Fix agents launch.
T+8 min	"What changed? Check the deploy manifest..."	Causal PR agent identifies the workflow change PR (merged 6 weeks ago). Code Fix agent shows the backwards-incompatible field access.
T+10 min	"Staging worked for 6 weeks, can't be that PR..."	Team reviews fix, clicks "Create PR", and fixes forward. Incident resolved.
T+20 min	Still investigating. "Let's check if the data shapes differ..."
T+45 min	Root cause identified. Rollback initiated.

The key insight: Kestrel would not have been confused by the 6 week gap between merge and deploy. The Causal PR agent searches and correlates code changes against the actual failure.

Why Staging Did Not Catch It (And Never Will)

Comprehensive test cases with the QA team
Extended staging soak. Six weeks of stability.
Happy path and edge case coverage
Manual review of the change by multiple engineers

But staging will never catch data shape incompatibilities with production. Staging has clean, recent data. Production has years of accumulated templates, deprecated fields, and edge cases that no test fixture can replicate. This is not a DNAnexus-specific problem. It is a fundamental gap in how our industry tests changes.

The question is not "how do we prevent this?" We cannot, not completely. The question is: when it happens, how fast can you identify what went wrong and fix it?

The difference between a 45-minute outage and a 10-minute one is not prevention. It is detection and resolution speed. Every minute you spend playing git blame detective is a minute your users are affected.

Why This Matters

The pattern behind my DNAnexus incident is incredibly common:

The Long Staging Soak The longer a change is "stable" in staging, the less likely anyone is to suspect it when production breaks. Six weeks of green builds become an alibi, not evidence.

The Backwards Compatibility Trap Adding new complexity to existing systems means interacting with data shapes you have never seen. Legacy templates, deprecated fields, customer-specific configurations. These exist in production but rarely in test environments.

The Multi-Repo Blind Spot In microservice architectures, a change in Service A can cause failures in Service B. Manual investigation naturally starts with the failing service's repo. The causal change might be three repos away.

The "It Can't Be That" Bias When a change has been in staging for 6 weeks, your brain immediately rules it out. "We tested that. It is proven. It must be something else." An AI agent does not have this bias. It evaluates every PR against the evidence, regardless of how long ago it was merged.

Try It on Your Own Cluster

If you want to see this on your own infrastructure, connect Kestrel to your Kubernetes clusters and GitHub or GitLab repositories. The next time an application-level incident fires, Kestrel will:

Detect the incident and run root cause analysis
Classify it as an application bug vs infrastructure issue
Search all connected repos for the causal PR
Generate a code fix with one-click PR creation

The change I deployed at DNAnexus was not wrong. The feature worked exactly as designed. The problem was that production data did not look like staging data, and nobody could have known that without deploying to production. That is going to keep happening.

Kestrel does not prevent code bugs. It catches them faster. Not by replacing your team's judgment, but by doing the tedious correlation work that takes humans hours in about 5 minutes.

See It on Your Own Infrastructure

We offer a 2-week free trial. Connect your clusters and repositories, and the next production incident will come with a root cause and a fix.

Start Free Trial