Oct 11, 2025

Kubernetes Incident Response

Real-time detection and remediation for Kubernetes incidents before they become outages.

Raman Varma

Kestrel now provides full incident response for Kubernetes clusters. We detect incidents in real-time by ingesting events, pod logs, node conditions, network telemetry, and more - then generate ready-to-apply fixes that require explicit approval before deployment.

Native Integration

Kestrel connects to the Kubernetes API Server via kubernetes/client-go informers, watching workloads, services, namespaces, network policies, and events in real-time. For network visibility, we integrate with CNIs including Cilium (via Hubble Relay), GKE Dataplane V2, and AWS VPC CNI. For L7 traffic integration, we support Istio service mesh via Envoy's Access Log Service or Ztunnel.

Kestrel automatically discovers and connects to available data sources in your cluster, so you don't have to configure anything. If Cilium is running, we connect to Hubble. If Istio is deployed, we tap into Envoy/Ztunnel.

Kestrel Operator <-> Kestrel Cloud

The open-source Kestrel Operator runs in your cluster and establishes a bidirectional gRPC stream over mTLS to Kestrel Cloud. It continuously streams Kubernetes resource metadata, events, logs, and network flows to Kestrel Cloud.

The bidirectional stream also lets Kestrel Cloud perform real-time, read-only investigations inside your cluster - the same way a human engineer would use kubectl, iptables, conntrack, tcpdump, and eBPF tooling to debug issues, but faster and more thorough. By default, the operator runs with a read-only ClusterRole, requesting only the permissions needed for observability. Write permissions are opt-in and only required if you want the ability to apply approved fixes from the Kestrel platform or Slack app.

Kestrel detects all Kubernetes infrastructure incidents - here are a few examples:

OOMKilled pods and memory pressure
NetworkPolicy misconfigurations causing traffic drops
DNS resolution failures and CoreDNS issues
Pod scheduling failures and resource constraints
CNI reconciliation degradation
Service mesh misconfigurations and mTLS failures

Purpose-Built Models

We trained our own model for Kubernetes incident response via supervised fine-tuning on a custom dataset of real-world incidents. The model is optimized for:

multi-step tool use with the Kubernetes API (kubectl, client-go), CNI debugging interfaces, and system-level tools like conntrack, iptables, and eBPF introspection
structured output constraints for YAML manifest generation
multi-hop reasoning to trace root causes across resource dependencies

We're also continuously improving model performance through RLHF on fix approvals and rejections.

Human-in-the-Loop

By default, all fixes require explicit approval before being applied - either via the Kestrel dashboard or Slack. You can apply fixes directly to your cluster via the Kestrel Operator, or create pull requests via the Kestrel GitHub integration for GitOps workflows (ArgoCD, Flux).

Installation

Deploy the Kestrel Operator to connect your cluster with a single Helm command:

helm install kestrel-operator \
  oci://ghcr.io/kestrelai/charts/kestrel-operator \
  --namespace kestrel-ai --create-namespace \
  -f kestrel-operator-values.yaml

See the quickstart guide or check out the open-source operator on GitHub.