Cloud Incident Response
Real-time detection and remediation for cloud infrastructure incidents before they become outages.
Today, we're launching Cloud Incident Response - automated detection and remediation for cloud infrastructure incidents across AWS, GCP, Azure, and OCI. Kestrel monitors your cloud resources in real-time, detects misconfigurations and failures, performs root cause analysis, and generates exact fixes via CLI commands, Terraform, or Pulumi.
Multi-Cloud Integration
Kestrel ingests signals from your cloud providers' native monitoring, audit, and security systems. Some examples:
- AWS - CloudTrail, CloudWatch, EventBridge, Config, SecurityHub, GuardDuty, Health Dashboard
- GCP - Cloud Audit Logs, Cloud Logging, Cloud Monitoring, Error Reporting, Security Command Center, Recommender
- Azure - Activity Logs, Azure Monitor, Log Analytics, Service Health, Microsoft Defender for Cloud, Advisor
- OCI - Audit, Logging, Monitoring, Events, Cloud Guard, Ops Insights
Every Cloud Service, Every Incident
Kestrel detects and generates fixes for incidents across every cloud service - compute, networking, storage, databases, streaming, security, IAM, containers, serverless, ML, and more. Here are some examples:
- VPC peering, routing, and firewall misconfigurations
- Security group, NACL, and network security group issues
- Kafka/MSK cluster capacity, under-replicated partitions, and ISR thrashing
- Kubernetes cluster connectivity and node failures (EKS, GKE, AKS, OKE)
- Database replication lag, failover events, and connection exhaustion (RDS, Cloud SQL, Azure SQL, OCI DB)
- Load balancer health check failures and target group issues
- Storage quota exhaustion and access permission changes
Root Cause Analysis
When Kestrel detects an incident, it performs iterative investigation using read-only tool calls against your cloud account - the same way a human engineer would debug, but faster and more thorough. Kestrel builds a timeline of events, correlates signals across services, and identifies the root cause with supporting evidence.
Kestrel also searches your tribal knowledge sources (Slack, Notion, Confluence, Jira, Glean, Linear, GitHub) for relevant past incidents and runbooks to inform the investigation.
Multi-Format Fixes
Kestrel generates exact, ready-to-apply fixes - not generic suggestions. Apply fixes with a single click directly from the Kestrel platform, or raise a PR against your IaC repository. Our fixes achieve 96% remediation accuracy, continuously improving via RLHF on fix approvals and rejections. Teams using Kestrel reduce MTTR by over 90%.
Fixes are generated in your preferred format:
- Cloud CLI- Ready-to-execute commands (AWS CLI, gcloud, az, OCI CLI) for immediate remediation
- Terraform- HCL code that integrates into your existing IaC workflows
- Pulumi- Golang/TypeScript/Python/C#/Java code for Pulumi users
For complex operations like Kafka partition reassignment, Kestrel can execute multi-step remediation workflows via SSM on EC2 instances - handling file creation, command execution, and async polling automatically.
See It In Action
Watch how Kestrel detects and remediates real cloud infrastructure incidents:
Cloud Networking Incident Detection & Remediation
A deep dive into how Kestrel automatically identifies and remediates complex networking misconfigurations - before they become outages.
Kafka/MSK Production Incident Response
Managing Kafka in production requires deep expertise in both cloud infrastructure and Kafka internals. Kestrel brings both, automatically resolving incidents like under-replicated partitions, ISR thrashing, and consumer group rebalance storms.
Purpose-Built Models
We trained for cloud incident response via supervised fine-tuning on a custom dataset of real-world cloud incidents. The model is optimized for:
- Multi-step tool use with cloud CLIs (AWS CLI, gcloud, az, OCI CLI), service-specific APIs, and system-level debugging tools
- Structured output constraints for IaC generation (Terraform HCL, CloudFormation YAML, Pulumi code)
- Multi-hop reasoning to trace root causes across interconnected cloud resources and services
- Cross-service correlation to identify cascading failures and dependency issues
We continuously improve model performance through RLHF on fix approvals and rejections from your team.
Slack Integration
Cloud incidents appear in your Slack workspace alongside Kubernetes incidents. Review the AI-generated investigation, see the timeline of events, approve or reject fixes, and track resolution - all without leaving Slack.
Getting Started
To enable Cloud Incident Response, connect your cloud accounts via Integrations → Cloud. Kestrel uses cross-account roles (AWS IAM, GCP service accounts, Azure service principals, OCI policies) with read-only access for detection and scoped write access for remediation.