Nov 25, 2025

Cloud Incident Response

Real-time detection and remediation for cloud infrastructure incidents before they become outages.

Raman Varma

Today, we're launching Cloud Incident Response - automated detection and remediation for cloud infrastructure incidents across AWS, GCP, Azure, and OCI. Kestrel monitors your cloud resources in real-time, detects misconfigurations and failures, performs root cause analysis, and generates exact fixes via CLI commands, Terraform, or Pulumi.

Multi-Cloud Integration

Kestrel ingests signals from your cloud providers' native monitoring, audit, and security systems. Some examples:

AWS - CloudTrail, CloudWatch, EventBridge, Config, SecurityHub, GuardDuty, Health Dashboard
GCP - Cloud Audit Logs, Cloud Logging, Cloud Monitoring, Error Reporting, Security Command Center, Recommender
Azure - Activity Logs, Azure Monitor, Log Analytics, Service Health, Microsoft Defender for Cloud, Advisor
OCI - Audit, Logging, Monitoring, Events, Cloud Guard, Ops Insights

Every Cloud Service, Every Incident

Kestrel detects and generates fixes for incidents across every cloud service - compute, networking, storage, databases, streaming, security, IAM, containers, serverless, ML, and more. Here are some examples:

VPC peering, routing, and firewall misconfigurations
Security group, NACL, and network security group issues
Kafka/MSK cluster capacity, under-replicated partitions, and ISR thrashing
Kubernetes cluster connectivity and node failures (EKS, GKE, AKS, OKE)
Database replication lag, failover events, and connection exhaustion (RDS, Cloud SQL, Azure SQL, OCI DB)
Load balancer health check failures and target group issues
Storage quota exhaustion and access permission changes

Root Cause Analysis

When Kestrel detects an incident, it performs iterative investigation using read-only tool calls against your cloud account - the same way a human engineer would debug, but faster and more thorough. Kestrel builds a timeline of events, correlates signals across services, and identifies the root cause with supporting evidence.

Kestrel also searches your tribal knowledge sources (Slack, Notion, Confluence, Jira, Glean, Linear, GitHub) for relevant past incidents and runbooks to inform the investigation.

Multi-Format Fixes

Kestrel generates exact, ready-to-apply fixes - not generic suggestions. Apply fixes with a single click directly from the Kestrel platform, or raise a PR against your IaC repository. Our fixes achieve 96% remediation accuracy, continuously improving via RLHF on fix approvals and rejections. Teams using Kestrel reduce MTTR by over 90%.

Fixes are generated in your preferred format:

Cloud CLI- Ready-to-execute commands (AWS CLI, gcloud, az, OCI CLI) for immediate remediation
Terraform- HCL code that integrates into your existing IaC workflows
Pulumi- Golang/TypeScript/Python/C#/Java code for Pulumi users

For complex operations like Kafka partition reassignment, Kestrel can execute multi-step remediation workflows via SSM on EC2 instances - handling file creation, command execution, and async polling automatically.

See It In Action

Watch how Kestrel detects and remediates real cloud infrastructure incidents:

Cloud Networking Incident Detection & Remediation

A deep dive into how Kestrel automatically identifies and remediates complex networking misconfigurations - before they become outages.

Kafka/MSK Production Incident Response

Managing Kafka in production requires deep expertise in both cloud infrastructure and Kafka internals. Kestrel brings both, automatically resolving incidents like under-replicated partitions, ISR thrashing, and consumer group rebalance storms.

Purpose-Built Models

We trained for cloud incident response via supervised fine-tuning on a custom dataset of real-world cloud incidents. The model is optimized for:

Multi-step tool use with cloud CLIs (AWS CLI, gcloud, az, OCI CLI), service-specific APIs, and system-level debugging tools
Structured output constraints for IaC generation (Terraform HCL, CloudFormation YAML, Pulumi code)
Multi-hop reasoning to trace root causes across interconnected cloud resources and services
Cross-service correlation to identify cascading failures and dependency issues

We continuously improve model performance through RLHF on fix approvals and rejections from your team.

Slack Integration

Cloud incidents appear in your Slack workspace alongside Kubernetes incidents. Review the AI-generated investigation, see the timeline of events, approve or reject fixes, and track resolution - all without leaving Slack.

Getting Started

To enable Cloud Incident Response, connect your cloud accounts via Integrations → Cloud. Kestrel uses cross-account roles (AWS IAM, GCP service accounts, Azure service principals, OCI policies) with read-only access for detection and scoped write access for remediation.