Back to Changelog
Oct 11, 2025

Kubernetes Incident Response

Real-time detection and remediation for Kubernetes incidents before they become outages.

Raman Varma
Raman Varma

Kestrel now provides full incident response for Kubernetes clusters. We detect incidents in real-time by ingesting events, pod logs, node conditions, network telemetry, and more - then generate ready-to-apply fixes that require explicit approval before deployment.

Kubernetes Incident Response Demo

Native Integration

Kestrel connects to the Kubernetes API Server via kubernetes/client-go informers, watching workloads, services, namespaces, network policies, and events in real-time. For network visibility, we integrate with CNIs including Cilium (via Hubble Relay), GKE Dataplane V2, and AWS VPC CNI. For L7 traffic integration, we support Istio service mesh via Envoy's Access Log Service or Ztunnel.

Kestrel automatically discovers and connects to available data sources in your cluster, so you don't have to configure anything. If Cilium is running, we connect to Hubble. If Istio is deployed, we tap into Envoy/Ztunnel.

Kestrel Operator <-> Kestrel Cloud

Kubernetes ClusterAPI Serverevents, resourcesworkloads, servicesCNI / Service Meshnetwork flows: L3/L4from CNI, L7 from MeshPods / Nodescontainer logs,events, conditionsnamespace: kestrel-aiKestrel OperatorIngestion & Fix ExecutionKestrel CloudData Plane1Signal Correlation2Root Cause Analysis3Remediation EnginemetadataflowslogsgRPCmTLS encrypted · bidirectional

The open-source Kestrel Operator runs in your cluster and establishes a bidirectional gRPC stream over mTLS to Kestrel Cloud. It continuously streams Kubernetes resource metadata, events, logs, and network flows to Kestrel Cloud.

The bidirectional stream also lets Kestrel Cloud perform real-time, read-only investigations inside your cluster - the same way a human engineer would use kubectl, iptables, conntrack, tcpdump, and eBPF tooling to debug issues, but faster and more thorough. By default, the operator runs with a read-only ClusterRole, requesting only the permissions needed for observability. Write permissions are opt-in and only required if you want the ability to apply approved fixes from the Kestrel platform or Slack app.

Kubernetes Incident Detection Demo

Kestrel detects all Kubernetes infrastructure incidents - here are a few examples:

  • OOMKilled pods and memory pressure
  • NetworkPolicy misconfigurations causing traffic drops
  • DNS resolution failures and CoreDNS issues
  • Pod scheduling failures and resource constraints
  • CNI reconciliation degradation
  • Service mesh misconfigurations and mTLS failures

Purpose-Built Models

We trained our own model for Kubernetes incident response via supervised fine-tuning on a custom dataset of real-world incidents. The model is optimized for:

  • multi-step tool use with the Kubernetes API (kubectl, client-go), CNI debugging interfaces, and system-level tools like conntrack, iptables, and eBPF introspection
  • structured output constraints for YAML manifest generation
  • multi-hop reasoning to trace root causes across resource dependencies

We're also continuously improving model performance through RLHF on fix approvals and rejections.

Human-in-the-Loop

By default, all fixes require explicit approval before being applied - either via the Kestrel dashboard or Slack. You can apply fixes directly to your cluster via the Kestrel Operator, or create pull requests via the Kestrel GitHub integration for GitOps workflows (ArgoCD, Flux).

Installation

Deploy the Kestrel Operator to connect your cluster with a single Helm command:

helm install kestrel-operator \
  oci://ghcr.io/kestrelai/charts/kestrel-operator \
  --namespace kestrel-ai --create-namespace \
  -f kestrel-operator-values.yaml

See the quickstart guide or check out the open-source operator on GitHub.