Kubernetes Incident Response
Real-time detection and remediation for Kubernetes incidents before they become outages.
Kestrel now provides full incident response for Kubernetes clusters. We detect incidents in real-time by ingesting events, pod logs, node conditions, network telemetry, and more - then generate ready-to-apply fixes that require explicit approval before deployment.

Native Integration
Kestrel connects to the Kubernetes API Server via kubernetes/client-go informers, watching workloads, services, namespaces, network policies, and events in real-time. For network visibility, we integrate with CNIs including Cilium (via Hubble Relay), GKE Dataplane V2, and AWS VPC CNI. For L7 traffic integration, we support Istio service mesh via Envoy's Access Log Service or Ztunnel.
Kestrel automatically discovers and connects to available data sources in your cluster, so you don't have to configure anything. If Cilium is running, we connect to Hubble. If Istio is deployed, we tap into Envoy/Ztunnel.
Kestrel Operator <-> Kestrel Cloud
The open-source Kestrel Operator runs in your cluster and establishes a bidirectional gRPC stream over mTLS to Kestrel Cloud. It continuously streams Kubernetes resource metadata, events, logs, and network flows to Kestrel Cloud.
The bidirectional stream also lets Kestrel Cloud perform real-time, read-only investigations inside your cluster - the same way a human engineer would use kubectl, iptables, conntrack, tcpdump, and eBPF tooling to debug issues, but faster and more thorough. By default, the operator runs with a read-only ClusterRole, requesting only the permissions needed for observability. Write permissions are opt-in and only required if you want the ability to apply approved fixes from the Kestrel platform or Slack app.

Kestrel detects all Kubernetes infrastructure incidents - here are a few examples:
- OOMKilled pods and memory pressure
- NetworkPolicy misconfigurations causing traffic drops
- DNS resolution failures and CoreDNS issues
- Pod scheduling failures and resource constraints
- CNI reconciliation degradation
- Service mesh misconfigurations and mTLS failures
Purpose-Built Models
We trained our own model for Kubernetes incident response via supervised fine-tuning on a custom dataset of real-world incidents. The model is optimized for:
- multi-step tool use with the Kubernetes API (
kubectl,client-go), CNI debugging interfaces, and system-level tools likeconntrack,iptables, and eBPF introspection - structured output constraints for YAML manifest generation
- multi-hop reasoning to trace root causes across resource dependencies
We're also continuously improving model performance through RLHF on fix approvals and rejections.
Human-in-the-Loop
By default, all fixes require explicit approval before being applied - either via the Kestrel dashboard or Slack. You can apply fixes directly to your cluster via the Kestrel Operator, or create pull requests via the Kestrel GitHub integration for GitOps workflows (ArgoCD, Flux).
Installation
Deploy the Kestrel Operator to connect your cluster with a single Helm command:
helm install kestrel-operator \
oci://ghcr.io/kestrelai/charts/kestrel-operator \
--namespace kestrel-ai --create-namespace \
-f kestrel-operator-values.yamlSee the quickstart guide or check out the open-source operator on GitHub.