Back to Blog
Kafka9 min read

Managing Kafka in Production Is Hard. Kestrel Makes It Easy.

Managing Kafka in production requires deep expertise in both cloud infrastructure and Kafka internals. Kestrel brings both, automatically resolving incidents like under-replicated partitions, ISR thrashing, and consumer group rebalance storms.

January 7, 2025By Raman Varma, Co-founder & CEO

The Challenge of Running Kafka in Production

AWS MSK abstracts away some of the infrastructure complexity of running Kafka - provisioning, patching, and broker replacement are handled automatically. But operational complexity remains. Resolving incidents often requires deep Kafka expertise: understanding partition distribution, replication dynamics, consumer group coordination, and the nuances of tools like kafka-reassign-partitions.sh. Most organizations still need dedicated Kafka engineers or SREs with specialized knowledge to keep clusters healthy.

The challenge is that most incidents require action at two levels: the AWS infrastructure layer (scaling brokers, modifying storage, updating configurations) and the Kafka protocol layer (rebalancing partitions, managing consumer groups, inspecting topic state).

Kestrel bridges this gap with native access to both AWS APIs and Kafka's internal protocols - enabling it to detect incidents, correlate signals across layers, and generate complete fixes. In this post, we walk through a specific example: scaling an undersized cluster with under-replicated partitions.

Why Scaling MSK Requires More Than Adding Brokers

AWS MSK supports autoscaling for storage - when brokers run low on disk, MSK can automatically provision additional capacity. However, broker count scaling is a manual operation that requires careful coordination. When you add brokers to an MSK cluster, Kafka does not automatically redistribute existing partitions. The new brokers remain idle while the original brokers continue handling the full workload. To actually distribute load, you must explicitly run Kafka's kafka-reassign-partitions.sh tool - a multi-step process that generates a reassignment plan, executes it, and verifies completion.

After running: aws kafka update-broker-count --target-number-of-broker-nodes 4

BEFORE REBALANCING:
┌─────────────────────────────────────────────────────────────────┐
│  Broker 1: CPU [████████████] 95%   Partitions: P0, P2, P4, P6  │
│  Broker 2: CPU [███████████░] 90%   Partitions: P1, P3, P5, P7  │
│  Broker 3: CPU [░░░░░░░░░░░░] 0%    Partitions: (none)          │
│  Broker 4: CPU [░░░░░░░░░░░░] 0%    Partitions: (none)          │
└─────────────────────────────────────────────────────────────────┘

The new brokers have no partitions! Original brokers still overloaded.

AFTER kafka-reassign-partitions.sh:
┌─────────────────────────────────────────────────────────────────┐
│  Broker 1: CPU [██████░░░░░░] 48%   Partitions: P0, P4          │
│  Broker 2: CPU [██████░░░░░░] 45%   Partitions: P1, P5          │
│  Broker 3: CPU [██████░░░░░░] 48%   Partitions: P2, P6          │
│  Broker 4: CPU [██████░░░░░░] 45%   Partitions: P3, P7          │
└─────────────────────────────────────────────────────────────────┘

Now load is balanced. This requires running Kafka CLI commands.

Demo Setup

We've created an open-source Terraform configuration that demonstrates this scenario. The demo deploys a minimal MSK cluster and a stress testing tool written in Go that generates enough load to trigger under-replication.

Clone this repo and follow along with your own AWS account.

Architecture Overview

The demo creates a minimal MSK cluster designed to buckle under moderate load. We intentionally use kafka.t3.small instances - the smallest broker type MSK offers - with only 2 brokers.

The Terraform configuration provisions:

MSK Cluster: 2-broker cluster using kafka.t3.small instances across two availability zones. Each broker has 150GB EBS storage and runs Kafka 3.5.1.

main.tf - MSK Configuration

resource "aws_msk_configuration" "demo" {
  name           = "msk-capacity-demo-config"
  kafka_versions = ["3.5.1"]

  # Replication factor of 2 with only 2 brokers means EVERY partition
  # must have replicas on BOTH brokers - any broker slowdown causes
  # under-replicated partitions
  server_properties = <<PROPERTIES
auto.create.topics.enable=true
default.replication.factor=2
min.insync.replicas=2
num.partitions=12
replica.lag.time.max.ms=10000
PROPERTIES
}

resource "aws_msk_cluster" "demo" {
  cluster_name           = "msk-capacity-demo"
  kafka_version          = "3.5.1"
  number_of_broker_nodes = 2  # Minimum - easily overwhelmed

  broker_node_group_info {
    instance_type   = "kafka.t3.small"  # Smallest instance type
    client_subnets  = [aws_subnet.msk_1.id, aws_subnet.msk_2.id]
    security_groups = [aws_security_group.msk.id]

    storage_info {
      ebs_storage_info {
        volume_size = 150
      }
    }
  }

  enhanced_monitoring = "PER_TOPIC_PER_BROKER"
}
  • VPC Infrastructure: A dedicated VPC with private subnets for the brokers and a public subnet for the client instance. Security groups allow Kafka traffic (port 9092) between the client and brokers.
  • EC2 Client Instance: A c5.4xlarge instance pre-configured with the Kafka CLI tools and our Go-based stress testing producer. SSM Agent is installed for remote command execution.
┌──────────────────────────────────────────────────────────────────────────┐
│                              AWS Account                                 │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │                        VPC (10.0.0.0/16)                           │  │
│  │                                                                    │  │
│  │  ┌────────────────────────────┐  ┌──────────────────────────────┐  │  │
│  │  │      Private Subnets       │  │          EC2 Client          │  │  │
│  │  │                            │  │                              │  │  │
│  │  │  ┌────────┐   ┌────────┐   │  │   ┌──────────────────────┐   │  │  │
│  │  │  │Broker 1│   │Broker 2│   │  │   │  msk-capacity-demo   │   │  │  │
│  │  │  │t3.small│   │t3.small│   │  │   │      -client         │   │  │  │
│  │  │  └────────┘   └────────┘   │  │   │                      │   │  │  │
│  │  │                            │  │   │  - Kafka CLI         │   │  │  │
│  │  │        MSK Cluster         │  │   │  - SSM Agent         │   │  │  │
│  │  │     (kafka.t3.small)       │  │   │  - Go Producer       │   │  │  │
│  │  │              │             │  │   └──────────────────────┘   │  │  │
│  │  └──────────────┼─────────────┘  └───────────────┼──────────────┘  │  │
│  │                 │                                │                 │  │
│  │                 └───── Bootstrap connection ─────┘                 │  │
│  │                                │                                   │  │
│  │                                │ Metrics                           │  │
│  │                                ▼                                   │  │
│  │   ┌────────────────────────────────────────────────────────────┐   │  │
│  │   │                     CloudWatch Alarm                       │   │  │
│  │   │               UnderReplicatedPartitions > 0                │   │  │
│  │   └────────────────────────────────────────────────────────────┘   │  │
│  └────────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────┘

When the stress test runs, the sequence of events is:

1. The Go producer spawns 64 goroutines to send high-throughput messages with acks=1, maximizing write speed by only waiting for the leader:

stress_producer.go - Worker goroutines

// Start 64 producer goroutines - each one flooding the cluster
for i := 0; i < *workers; i++ {
    producerWg.Add(1)
    go func(workerID int, payload []byte) {
        defer producerWg.Done()

        keyBuf := make([]byte, 16)
        for {
            select {
            case <-ctx.Done():
                return
            default:
                rand.Read(keyBuf)  // Unique key for each message

                producer.Input() <- &sarama.ProducerMessage{
                    Topic: *topic,
                    Key:   sarama.ByteEncoder(keyBuf),
                    Value: sarama.ByteEncoder(payload),
                }

                atomic.AddInt64(&totalMessages, 1)
                atomic.AddInt64(&totalBytes, int64(*msgSize))
            }
        }
    }(i, messages[i])
}
  1. The small brokers can't keep up with replication - CPU spikes to 90%+ as they struggle to replicate data between each other
  2. The UnderReplicatedPartitions metric rises above zero as replicas fall behind
  3. Kestrel detects the incident and begins investigation

Under-Replication Cascade

With a minimal 2-broker cluster and replication.factor=2, every partition must replicate to both brokers. Under heavy load:

  1. Brokers receive writes faster than they can replicate to followers
  2. Replication can't keep up - replicas fall out of the ISR (in-sync replica set)
  3. UnderReplicatedPartitions metric goes above zero
  4. Producers start timing out waiting for acknowledgments
  5. If a broker fails now, you could lose data

The Two-Step Fix

Resolving capacity issues in Kafka requires two distinct operations:

Step 1: Add Brokers

First, scale the cluster to add capacity. Kestrel generates fixes in multiple formats - AWS CLI for quick execution, or Terraform/Pulumi for teams managing infrastructure as code:

AWS CLI:

aws kafka update-broker-count \
  --cluster-arn arn:aws:kafka:us-east-2:123456789:cluster/my-cluster/abc123 \
  --current-version K2EUQ1WTGCTBG2 \
  --target-number-of-broker-nodes 4

Terraform:

Scaling from 2 to 4 brokers requires additional subnets (MSK distributes brokers across subnets in different AZs):

# Add subnets for the new brokers
resource "aws_subnet" "msk_3" {
  vpc_id            = aws_vpc.msk.id
  cidr_block        = "10.0.3.0/24"
  availability_zone = data.aws_availability_zones.available.names[2]
  tags              = { Name = "msk-subnet-3" }
}

resource "aws_subnet" "msk_4" {
  vpc_id            = aws_vpc.msk.id
  cidr_block        = "10.0.4.0/24"
  availability_zone = data.aws_availability_zones.available.names[3]
  tags              = { Name = "msk-subnet-4" }
}

# Add route table associations
resource "aws_route_table_association" "msk_3" {
  subnet_id      = aws_subnet.msk_3.id
  route_table_id = aws_route_table.msk.id
}

resource "aws_route_table_association" "msk_4" {
  subnet_id      = aws_subnet.msk_4.id
  route_table_id = aws_route_table.msk.id
}

# Update the MSK cluster
resource "aws_msk_cluster" "my_cluster" {
  cluster_name           = "my-cluster"
  kafka_version          = "3.5.1"
- number_of_broker_nodes = 2
+ number_of_broker_nodes = 4

  broker_node_group_info {
    instance_type   = "kafka.t3.small"
-   client_subnets  = [aws_subnet.msk_1.id, aws_subnet.msk_2.id]
+   client_subnets  = [
+     aws_subnet.msk_1.id, aws_subnet.msk_2.id,
+     aws_subnet.msk_3.id, aws_subnet.msk_4.id
+   ]
    security_groups = [aws_security_group.msk.id]
    
    storage_info {
      ebs_storage_info {
        volume_size = 100
      }
    }
  }
}

This takes about 20-30 minutes as MSK provisions new broker nodes. But the job isn't done.

Step 2: Rebalance Partitions (Kafka CLI)

Kafka's kafka-reassign-partitions.sh tool moves partitions to the new brokers. This requires running commands on a machine with Kafka CLI and network access to the cluster:

# 1. Create a topics.json file listing topics to rebalance
echo '{"topics": [{"topic": "stress-test"}], "version": 1}' > /tmp/topics.json

# 2. Generate the reassignment plan
/opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server b-1.mycluster.xxx.kafka.us-east-2.amazonaws.com:9092 \
  --topics-to-move-json-file /tmp/topics.json \
  --broker-list 1,2,3,4 \
  --generate > /tmp/reassignment-plan.txt

# 3. Extract and save the proposed reassignment
grep -A1 'Proposed partition' /tmp/reassignment-plan.txt | tail -1 > /tmp/reassignment.json

# 4. Execute the reassignment
/opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server b-1.mycluster.xxx.kafka.us-east-2.amazonaws.com:9092 \
  --reassignment-json-file /tmp/reassignment.json \
  --execute

# 5. Verify completion
/opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server b-1.mycluster.xxx.kafka.us-east-2.amazonaws.com:9092 \
  --reassignment-json-file /tmp/reassignment.json \
  --verify

How Kestrel Detects and Fixes This

Kestrel understands that MSK capacity issues require both AWS-level changes and Kafka-level operations. When it detects under-replicated partitions correlated with high CPU across all brokers, it generates a two-part fix: first adding brokers via the AWS API or IaC, then executing partition reassignment via Kafka CLI commands.

AWS-Level Fix: Add Brokers

Kestrel generates the AWS CLI command to scale the cluster, adding new brokers to handle the increased load:

Kafka-Level Fix: Rebalance Partitions

After brokers are added, Kestrel generates Kafka CLI commands to redistribute partitions across all brokers, ensuring the new capacity is actually utilized:

Executing Kafka CLI Commands via SSM

The Kafka CLI commands are executed via AWS Systems Manager (SSM) on an EC2 instance in the same VPC as the cluster. Kestrel identifies these client instances during investigation and uses them to run the necessary commands.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SSM Command Execution Flow                           │
│                                                                             │
│   ┌─────────────┐         ┌─────────────┐         ┌─────────────────────┐   │
│   │   Kestrel   │         │   AWS SSM   │         │     EC2 Instance    │   │
│   │    Cloud    │         │   Service   │         │  (Kafka CLI client) │   │
│   └──────┬──────┘         └──────┬──────┘         └──────────┬──────────┘   │
│          │                       │                           │              │
│          │  1. SendCommand       │                           │              │
│          │  (via assumed role)   │                           │              │
│          │──────────────────────>│                           │              │
│          │                       │                           │              │
│          │                       │  2. Deliver command       │              │
│          │                       │  to SSM Agent             │              │
│          │                       │──────────────────────────>│              │
│          │                       │                           │              │
│          │                       │           3. Execute:     │              │
│          │                       │           - Create files  │              │
│          │                       │           - Run Kafka CLI │              │
│          │                       │           - Reassign      │              │
│          │                       │             partitions    │              │
│          │                       │                           │              │
│          │                       │  4. Return output         │              │
│          │                       │<──────────────────────────│              │
│          │                       │                           │              │
│          │  5. GetCommandInvocation                          │              │
│          │  (poll for results)   │                           │              │
│          │<──────────────────────│                           │              │
│          │                       │                           │              │
│          ▼                       ▼                           ▼              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  6. Update incident with command output and completion status       │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Coordinating Async Operations

Adding brokers to an MSK cluster is an asynchronous operation that can take 20-30 minutes. Kestrel doesn't just fire and forget - it actively monitors the operation's progress by polling the cluster state. Once the new brokers are fully provisioned and online, Kestrel immediately triggers the partition rebalancing via Kafka CLI commands through SSM, so no manual intervention is needed between the two steps.

Learning from Past Incidents

Kafka expertise is often siloed - that one engineer who knows exactly how to tune replica.fetch.max.bytes or why the last rebalancing took 6 hours. Kestrel captures this knowledge by connecting to your existing tools: Slack threads where the team debugged ISR shrinkage at 2am, Confluence runbooks for partition reassignment, Jira tickets documenting that time you upgraded broker instances.

When Kestrel investigates an MSK incident, it searches this institutional memory for relevant context. If your team has solved a similar under-replication issue before, Kestrel surfaces those learnings alongside its automated analysis - combining real-time cluster diagnostics with your organization's hard-won operational knowledge.

When Adding Brokers is the Right Fix

Not every under-replication issue calls for more brokers. Adding brokers + rebalancing helps when:

  • The cluster is genuinely capacity-bound (high CPU/network across all brokers)
  • You have partition hot spots and need to spread load
  • Traffic has grown beyond original capacity planning

Adding brokers doesn't help when:

  • A specific broker is down (fix that broker)
  • Disk is full (add storage or clean up data)
  • Configuration mismatch like min.insync.replicas > replication.factor

Kestrel correlates broker-level metrics with the replication health signal to determine which scenario applies before recommending a fix.

Try It Yourself

Want to see Kestrel detect and fix this in real-time? Here's how:

  1. Clone the demo repository:
    git clone https://github.com/KestrelAI/Demos.git
  2. Sign up for a free trial and connect your AWS account
  3. Deploy the MSK cluster (~20 minutes for MSK provisioning):
    cd Demos/msk-capacity-demo && terraform init && terraform apply
  4. Run the stress test to trigger under-replication:
    ./scripts/run_demo.sh

    The stress test uses a high-performance Go producer with 64 goroutines, 1KB messages, and acks=1 for maximum throughput. This pushes 70-80 MB/sec per broker - enough to stress the t3.small brokers into under-replication.

  5. Watch Kestrel detect the incident and generate the complete two-step fix

Start Your Free Trial

Get 2 weeks free to test Kestrel with your own cloud infrastructure. Resolve misconfigurations before they become outages.

Register for Free Trial