How Kestrel Detects & Fixes Cloud Networking Incidents in Real Time
A deep dive into how Kestrel automatically identifies and remediates complex networking misconfigurations - before they become outages.
Cloud Networking Incidents: The Silent Killers
Cloud networking incidents are among the most difficult infrastructure problems to debug. Unlike application errors that produce stack traces, networking misconfigurations often fail silently - traffic simply disappears, connections timeout, and engineers spend hours chasing ghosts.
Kestrel handles the full spectrum of cloud networking incidents, e.g.:
- VPC peering blackholes - Asymmetric routes causing one-way communication
- Security group misconfigurations - Overly permissive rules or blocked legitimate traffic
- Transit gateway routing issues - Complex multi-VPC topologies with missing or conflicting routes
- Network ACL conflicts - Stateless rules blocking return traffic
- NAT gateway failures - Private subnets losing internet connectivity
- DNS resolution issues - Route 53 resolver rules, private hosted zones, and VPC DNS settings
In this post, we'll walk through a specific example - the VPC peering blackhole - to show how Kestrel detects and fixes these issues in real time.
Example: The VPC Peering Blackhole
VPC peering is a common pattern for connecting isolated network environments in AWS. However, a subtle misconfiguration can create what we call a "blackhole" - where traffic flows in one direction but responses silently disappear. These issues are notoriously difficult to debug because:
- The forward path works perfectly (requests arrive at the destination)
- No error messages are generated - packets are simply dropped
- Symptoms appear as random timeouts, making it look like application issues
- Traditional monitoring tools don't catch configuration drift
Demo Setup
We've created an open-source Terraform configuration that demonstrates this exact scenario. You can find it on GitHub:
Clone this repo and follow along with your own AWS account.
Architecture Overview
The demo creates two VPCs with a peering connection:
- VPC A contains two client instances. They use the same AMI, the same security group, and the same IAM role. The only difference is that they live in different subnets.
- VPC B contains a single EC2 instance running a simple HTTP server. Security groups are intentionally permissive so traffic is not blocked at that layer.
┌─────────────────────────────────────┐ ┌─────────────────────────┐ │ VPC A (10.10.0.0/16) │ │ VPC B (10.20.0.0/16) │ │ │ │ │ │ ┌───────────────┐ ┌──────────────┐ │ │ ┌───────────────────┐ │ │ │ Subnet A1 │ │ Subnet A2 │ │ │ │ Subnet B │ │ │ │ 10.10.1.0/24 │ │ 10.10.2.0/24 │ │ │ │ 10.20.1.0/24 │ │ │ │ │ │ │ │<------------>│ │ │ │ │ │ Client-OK │ │ Client-BAD │ │ │ │ Server │ │ │ │ │ │ │ │ │ │ │ │ │ └───────────────┘ └──────────────┘ │ │ └───────────────────┘ │ │ │ │ │ └─────────────────────────────────────┘ └─────────────────────────┘ VPC B Route Table (THE MISCONFIGURATION): ┌────────────────────┬─────────────────────┐ │ Destination │ Target │ ├────────────────────┼─────────────────────┤ │ 10.10.1.0/24 │ pcx-xxx (peering) │ ✅ Only routes to Subnet A1! │ (missing!) │ 10.10.2.0/24 │ ❌ No route to Subnet A2 └────────────────────┴─────────────────────┘
The Misconfiguration
The key issue is in this Terraform resource (from main.tf):
# B -> A routes (INTENTIONALLY MISCONFIGURED!)
# Only route back to A subnet 10.10.1.0/24 (a1), MISSING 10.10.2.0/24 (a2)
resource "aws_route" "b_to_a_only_a1" {
route_table_id = aws_route_table.b_rt.id
destination_cidr_block = "10.10.1.0/24" # Only subnet A1!
vpc_peering_connection_id = aws_vpc_peering_connection.peer.id
}This creates a route to only one of VPC A's subnets. Traffic from 10.10.2.0/24 (Subnet A2) can reach the server, but responses are dropped because VPC B has no route back to that subnet.
Testing the Setup
After deploying the Terraform, you can test the misconfiguration using AWS SSM to run commands on the client instances:
1. Get the Server IP
SERVER_IP=$(terraform output -raw server_private_ip)
echo "Server IP: $SERVER_IP"2. Test Client-OK
CLIENT_OK=$(terraform output -raw client_ok_id)
CMD_ID=$(aws ssm send-command \
--document-name "AWS-RunShellScript" \
--targets "Key=instanceids,Values=$CLIENT_OK" \
--parameters "commands=curl -m 5 -s http://$SERVER_IP:8080" \
--query "Command.CommandId" --output text)
aws ssm get-command-invocation \
--command-id "$CMD_ID" \
--instance-id "$CLIENT_OK" \
--query "StandardOutputContent" --output textExpected output: hello from VPC B server
3. Test Client-BROKEN
CLIENT_BAD=$(terraform output -raw client_broken_id)
CMD_ID=$(aws ssm send-command \
--document-name "AWS-RunShellScript" \
--targets "Key=instanceids,Values=$CLIENT_BAD" \
--parameters "commands=curl -m 5 -v http://$SERVER_IP:8080 || echo TIMEOUT" \
--query "Command.CommandId" --output text)
aws ssm get-command-invocation \
--command-id "$CMD_ID" \
--instance-id "$CLIENT_BAD" \
--query "StandardOutputContent" --output textExpected output: TIMEOUT (connection timeout error)
How Kestrel Detects and Fixes This
With Kestrel, this misconfiguration is detected as soon as it is introduced.
Kestrel observes traffic patterns across the VPCs and notices that requests are leaving one subnet without corresponding return traffic. It correlates that behavior with recent routing changes and identifies the missing return route in VPC B as the root cause.
⚠️ Routing Blackhole Detected
VPC B (vpc-xxx) has incomplete return routes for VPC peering pcx-xxx. Route exists for 10.10.1.0/24 but missing for 10.10.2.0/24. Traffic from Subnet A2 will timeout due to dropped response packets.
Once identified, Kestrel generates the exact fix. That fix can be applied immediately using AWS route table commands, or as a pull request against your Terraform or IaC repository with the correct route added.
AWS CLI
# Delete the partial route
aws ec2 delete-route \
--route-table-id rtb-xxx \
--destination-cidr-block 10.10.1.0/24
# Create a route covering the full VPC CIDR
aws ec2 create-route \
--route-table-id rtb-xxx \
--destination-cidr-block 10.10.0.0/16 \
--vpc-peering-connection-id pcx-xxxTerraform
resource "aws_route" "b_to_a_full" {
route_table_id = aws_route_table.b_rt.id
destination_cidr_block = aws_vpc.a.cidr_block # Full VPC CIDR
vpc_peering_connection_id = aws_vpc_peering_connection.peer.id
}This happens just seconds after the misconfiguration is introduced, while the issue is still limited to flaky behavior and before it escalates into a full outage.
Once a fix is applied, you can rollback any changes with a single click. Kestrel also lets you launch AI agents to investigate your cloud infrastructure and confirm the incident was successfully resolved.
Kestrel automatically identifies all resources affected by the incident, making it easy to understand the blast radius and scope of impact across your infrastructure.
Using Your Team's Tribal Knowledge
Kestrel doesn't just rely on AWS signals - it ingests your team's tribal knowledge from runbooks, documentation, and past incidents. By connecting to Slack, Confluence, Jira, Linear, Glean, and other tools, Kestrel learns how your team has solved similar problems before.
Why This Matters
VPC peering misconfigurations like this are incredibly common in production environments:
- Multi-account architectures - Teams manage routes across accounts
- Infrastructure as Code drift - Manual console changes create inconsistencies
- Scaling - New subnets added without updating peering routes
Without automated detection, these issues can go unnoticed for weeks, causing intermittent failures that are extremely difficult to debug.
Try It Yourself
Want to see Kestrel resolve this in real-time? Here's how:
- Clone the demo repository:
git clone https://github.com/KestrelAI/Demos.git - Sign up for a free trial and connect your AWS account
- Apply the Terraform configuration:
cd Demos/vpc-peering-blackhole-demo && terraform apply - Watch Kestrel detect and resolve the misconfiguration in real-time
Start Your Free Trial
Get 2 weeks free to test Kestrel with your own cloud infrastructure. Resolve misconfigurations before they become outages.
Register for Free Trial