Your A100s Show 40GB Free, But Training Jobs Won't Schedule. Here's Why.
MIG fragmentation is the silent killer of GPU cluster utilization. You have capacity on paper, but your training jobs are stuck in Pending. Kestrel detects this invisible problem and shows you exactly how to fix it.
The Hidden Cost of GPU Sharing
NVIDIA's Multi-Instance GPU (MIG) technology transformed how organizations share expensive A100 and H100 GPUs. Instead of dedicating an entire 40GB A100 to a small inference workload, you can slice it into up to 7 isolated instances - each with its own compute, memory, and cache.
But MIG introduces a problem that doesn't exist with traditional GPU allocation: fragmentation. Unlike CPU or memory, MIG partitions follow strict geometric constraints. Not every combination of slices is valid, and the slices you create affect which slices you can create next.
The result? Your monitoring dashboard shows 40GB of GPU memory "available," but your training job requesting a 2g.10gb MIG instance sits in Pending indefinitely. The memory exists - it's just in the wrong shape.
Understanding MIG Geometry
An A100 40GB GPU can be partitioned into MIG instances following specific profiles. Each profile name indicates compute and memory: 1g.5gb means 1 GPU slice with ~5GB memory, 2g.10gb means 2 GPU slices with ~10GB memory.
A100 40GB MIG Profiles
Profile GPU Slices Memory Max Instances per GPU
─────────────────────────────────────────────────────────────
1g.5gb 1/7 5GB 7
2g.10gb 2/7 10GB 3
3g.20gb 3/7 20GB 2
4g.20gb 4/7 20GB 1
7g.40gb 7/7 40GB 1The critical constraint: MIG instances must be created from contiguous GPU slices. You can't combine slice 0-1 with slice 5-6 to make a 4g.20gb. This is where fragmentation happens.
Example: How fragmentation occurs on a single A100
INITIAL STATE: Full 40GB A100 available
┌─────────────────────────────────────────────────────────────────────────┐
│ [ Slice 0-6: Unpartitioned 7g.40gb available ] │
└─────────────────────────────────────────────────────────────────────────┘
AFTER: Two inference pods request 1g.5gb each
┌─────────────────────────────────────────────────────────────────────────┐
│ [1g.5gb] [1g.5gb] [ 5 slices FREE ] │
│ Pod A Pod B (~25GB available) │
└─────────────────────────────────────────────────────────────────────────┘
AFTER: Training pod requests 3g.20gb - SUCCEEDS (contiguous slices 2-4)
┌─────────────────────────────────────────────────────────────────────────┐
│ [1g.5gb] [1g.5gb] [ 3g.20gb ] [ 2 slices FREE ] │
│ Pod A Pod B Training C (10GB available) │
└─────────────────────────────────────────────────────────────────────────┘
AFTER: Another inference pod requests 1g.5gb
┌─────────────────────────────────────────────────────────────────────────┐
│ [1g.5gb] [1g.5gb] [ 3g.20gb ] [1g.5gb] [1 slice] │
│ Pod A Pod B Training C Pod D FREE │
└─────────────────────────────────────────────────────────────────────────┘
NOW: New training job requests 2g.10gb - FAILS!
┌─────────────────────────────────────────────────────────────────────────┐
│ [1g.5gb] [1g.5gb] [ 3g.20gb ] [1g.5gb] [1 slice] │
│ Pod A Pod B Training C Pod D FREE │
│ ↑ │
│ Only 1 contiguous slice! │
│ Need 2 for 2g.10gb │
└─────────────────────────────────────────────────────────────────────────┘
RESULT: 5GB "free" but 2g.10gb (10GB) cannot be allocated
The free slice is not contiguous with anything usefulDemo Setup
We've created a Helm chart that demonstrates MIG fragmentation on any Kubernetes cluster with NVIDIA GPUs and the GPU Operator installed. The demo deploys a mix of inference and training workloads that create fragmentation, then shows how Kestrel detects and resolves it.
Clone this repo and follow along with your own GPU cluster.
Prerequisites
This demo requires NVIDIA A100 or H100 GPUs with MIG (Multi-Instance GPU) enabled. MIG is only available on these GPU architectures - standard GPUs like T4, V100, or L4 won't work.
Setting Up a GKE Cluster with MIG
To demonstrate MIG fragmentation, we need the NVIDIA GPU Operator with mixed MIG strategy. This allows different pods to request different MIG profiles, with the operator dynamically creating instances on-demand.
Important: GKE requires specific configuration for the GPU Operator due to Container-Optimized OS (COS) filesystem constraints. The configuration below has been tested and verified to work.
Step 1: Create GKE cluster with A100 GPUs
# Create a GKE Standard cluster with A100 GPUs
# Do NOT use --gpu-partition-size (we want dynamic MIG via GPU Operator)
gcloud container clusters create mig-demo-cluster \
--zone us-central1-a \
--machine-type a2-highgpu-1g \
--accelerator type=nvidia-tesla-a100,count=1 \
--num-nodes=1 \
--release-channel rapidStep 2: Create PriorityClass for GPU Operator
# GKE has ResourceQuota that blocks system-critical PriorityClasses
# Create a custom PriorityClass for GPU Operator components
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: nvidia-gpu-operator
value: 100000
globalDefault: false
description: "Priority for NVIDIA GPU Operator components"
EOFStep 3: Install NVIDIA GPU Operator with GKE-specific settings
# Add NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install GPU Operator with GKE-specific configuration
# Key settings:
# - driver.enabled=false: Use GKE's pre-installed NVIDIA drivers
# - toolkit.installDir: Use exec-capable path (COS mounts /var with noexec)
# - hostPaths.driverInstallDir: Where GKE installs NVIDIA drivers
# - Custom PriorityClass to avoid GKE ResourceQuota restrictions
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=false \
--set toolkit.installDir=/home/kubernetes/bin/nvidia-toolkit \
--set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \
--set mig.strategy=mixed \
--set migManager.enabled=true \
--set migManager.env[0].name=MIG_PARTED_REBOOT_IF_REQUIRED \
--set migManager.env[0].value=true \
--set migManager.env[1].name=WITH_REBOOT \
--set migManager.env[1].value=true \
--set dcgmExporter.enabled=false \
--set daemonsets.priorityClassName=nvidia-gpu-operator \
--set operator.priorityClassName=nvidia-gpu-operator \
--set node-feature-discovery.priorityClassName=nvidia-gpu-operatorStep 4: Enable MIG mode on the GPU node
# Wait for GPU Operator pods to be running
kubectl get pods -n gpu-operator -w
# Once nvidia-mig-manager is running, apply MIG configuration
# This creates 7x 1g.5gb MIG instances on the A100
NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl label node $NODE nvidia.com/mig.config=all-1g.5gb --overwrite
# The MIG Manager will reboot the node to enable MIG mode
# Wait for the node to come back up (2-3 minutes)
kubectl get nodes -w
# Verify MIG is configured
kubectl get node $NODE -o jsonpath='{.metadata.labels}' | grep mig.config.state
# Should show: "nvidia.com/mig.config.state":"success"Why these settings?
driver.enabled=false- GKE pre-installs NVIDIA drivers on GPU nodestoolkit.installDir=/home/kubernetes/bin/nvidia-toolkit- COS mounts/varwith noexec; this path is exec-capablehostPaths.driverInstallDir- Points to GKE's driver installation locationMIG_PARTED_REBOOT_IF_REQUIRED=true- Allows MIG Manager to reboot node to enable MIG modedcgmExporter.enabled=false- DCGM profiling fails on GKE; not needed for MIG- Custom PriorityClass - GKE's ResourceQuota blocks pods with
system-node-criticalPriorityClass
Wait for all GPU Operator pods to be ready (this may take 2-3 minutes):
# Watch GPU Operator pods come up
kubectl get pods -n gpu-operator -w
# You should see these pods running:
# - gpu-operator (controller)
# - nvidia-container-toolkit-daemonset
# - nvidia-device-plugin-daemonset
# - nvidia-mig-manager
# - gpu-feature-discovery
# - nvidia-dcgm-exporter
# - nvidia-operator-validatorThe mig.strategy=mixed setting enables dynamic MIG partitioning. When pods request different MIG profiles (1g.5gb, 2g.10gb, etc.), the GPU Operator creates them on-demand from available GPU slices.
Cost Note: A100 GPUs are expensive (~$3-4/hour per node on GKE). This demo only needs to run for a few minutes - remember to delete the cluster when you're done.
Verify the GPU Operator is configured correctly:
# Check all GPU Operator pods are running
kubectl get pods -n gpu-operator
# Verify MIG manager is active and configured for mixed strategy
kubectl get nodes -o jsonpath='{.items[0].metadata.labels}' | jq . | grep mig
# Check node GPU capacity (will show nvidia.com/gpu: 1 initially)
kubectl get nodes -o json | jq '.items[].status.allocatable | with_entries(select(.key | contains("nvidia")))'Finally, deploy the Kestrel Operator to your cluster so it can detect the fragmentation incident.
Architecture Overview
The Helm chart deploys three components that create the fragmentation scenario:
values.yaml - Demo Configuration
# MIG Fragmentation Demo Configuration
# This creates a fragmentation scenario on A100 GPUs
namespace: mig-demo
# Phase 1: Deploy inference workloads that consume small MIG slices
inference:
enabled: true
replicas: 4
resources:
requests:
nvidia.com/mig-1g.5gb: 1 # Small slices scattered across GPUs
image: nvcr.io/nvidia/pytorch:24.01-py3
command: ["python", "-c", "import torch; torch.cuda.is_available(); import time; time.sleep(86400)"]
# Phase 2: Deploy a training workload that needs a medium MIG slice
# This will get stuck due to fragmentation
training:
enabled: true
replicas: 1
resources:
requests:
nvidia.com/mig-2g.10gb: 1 # Medium slice - will fail to schedule
image: nvcr.io/nvidia/pytorch:24.01-py3
command: ["python", "-c", "print('Training started'); import time; time.sleep(86400)"]
# Phase 3: Deploy more inference to maximize fragmentation
fragmenter:
enabled: true
replicas: 2
resources:
requests:
nvidia.com/mig-1g.5gb: 1
image: nvcr.io/nvidia/pytorch:24.01-py3
command: ["sleep", "infinity"]The chart deploys workloads in phases to create realistic fragmentation:
- Phase 1: Four inference pods claim
1g.5gbMIG slices, scattering small allocations across available GPUs - Phase 2: A training job requests
2g.10gb- but no GPU has 2 contiguous slices free - Phase 3: More inference pods arrive, consuming any remaining small slices and worsening fragmentation
┌──────────────────────────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ │ │ ┌────────────────────────────────────────────────────────────────────┐ │ │ │ GPU Node Pool │ │ │ │ │ │ │ │ ┌─────────────────────────┐ ┌─────────────────────────┐ │ │ │ │ │ gpu-node-1 (A100) │ │ gpu-node-2 (A100) │ │ │ │ │ │ │ │ │ │ │ │ │ │ MIG Configuration: │ │ MIG Configuration: │ │ │ │ │ │ ┌───┬───┬───┬───┐ │ │ ┌───┬───┬───┬───┐ │ │ │ │ │ │ │1g │1g │ - │1g │ │ │ │1g │ - │1g │ - │ │ │ │ │ │ │ │5gb│5gb│ │5gb│ │ │ │5gb│ │5gb│ │ │ │ │ │ │ │ └───┴───┴───┴───┘ │ │ └───┴───┴───┴───┘ │ │ │ │ │ │ ↑ ↑ ↑ │ │ ↑ ↑ │ │ │ │ │ │ inf-1 inf-2 inf-3 │ │ inf-4 frag-1 │ │ │ │ │ │ │ │ │ │ │ │ │ │ Free: 4 non-contiguous │ │ Free: 5 non-contiguous │ │ │ │ │ │ slices │ │ slices │ │ │ │ │ └─────────────────────────┘ └─────────────────────────┘ │ │ │ │ │ │ │ └────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌────────────────────────────────────────────────────────────────────┐ │ │ │ Pending Pods │ │ │ │ │ │ │ │ training-job-0: Pending │ │ │ │ └─ Requests: nvidia.com/mig-2g.10gb: 1 │ │ │ │ └─ Message: "0/2 nodes available: insufficient │ │ │ │ nvidia.com/mig-2g.10gb" │ │ │ │ │ │ │ │ NOTE: Both nodes have ~25GB GPU memory "free" but no 2g.10gb │ │ │ │ instance can be created due to non-contiguous slices │ │ │ └────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌────────────────────────────────────────────────────────────────────┐ │ │ │ Kestrel Operator │ │ │ │ │ │ │ │ Monitors: Pod scheduling failures, MIG resource allocation │ │ │ │ Streams: Events + Pod status to Kestrel Cloud │ │ │ └────────────────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘
The Helm Chart
Let's walk through the key templates that create this scenario:
Inference Deployment (Fragment Creators)
These pods consume small 1g.5gb MIG slices and stay running indefinitely, simulating always-on inference services:
templates/inference-deployment.yaml
{{- if .Values.inference.enabled }}
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ .Release.Name }}-inference
namespace: {{ .Values.namespace }}
labels:
app: mig-inference
demo: mig-fragmentation
spec:
replicas: {{ .Values.inference.replicas }}
selector:
matchLabels:
app: mig-inference
template:
metadata:
labels:
app: mig-inference
workload-type: inference
priority: low
spec:
# Use pod anti-affinity to spread inference pods across GPUs
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: mig-inference
topologyKey: "nvidia.com/gpu"
containers:
- name: inference
image: {{ .Values.inference.image }}
command:
{{- range .Values.inference.command }}
- {{ . | quote }}
{{- end }}
resources:
requests:
nvidia.com/mig-1g.5gb: {{ .Values.inference.resources.requests | dig "nvidia.com/mig-1g.5gb" 1 }}
limits:
nvidia.com/mig-1g.5gb: {{ .Values.inference.resources.requests | dig "nvidia.com/mig-1g.5gb" 1 }}
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
{{- end }}Training Job (The Victim)
This Job requests a 2g.10gb MIG instance - the profile that will fail due to fragmentation:
templates/training-job.yaml
{{- if .Values.training.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
name: {{ .Release.Name }}-training
namespace: {{ .Values.namespace }}
labels:
app: mig-training
demo: mig-fragmentation
spec:
backoffLimit: 0 # Don't retry - we want to see the scheduling failure
template:
metadata:
labels:
app: mig-training
workload-type: training
priority: high
spec:
restartPolicy: Never
containers:
- name: training
image: {{ .Values.training.image }}
command:
{{- range .Values.training.command }}
- {{ . | quote }}
{{- end }}
resources:
requests:
# This MIG profile requires 2 contiguous GPU slices
# Fragmentation from 1g.5gb pods will prevent scheduling
nvidia.com/mig-2g.10gb: {{ .Values.training.resources.requests | dig "nvidia.com/mig-2g.10gb" 1 }}
limits:
nvidia.com/mig-2g.10gb: {{ .Values.training.resources.requests | dig "nvidia.com/mig-2g.10gb" 1 }}
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
{{- end }}What Kestrel Sees
Within a minute of the training job failing to schedule, Kestrel detects the incident:
- The pod status monitor sees the training job in
Pendingstate - The event monitor captures the
FailedSchedulingevent with reason: "insufficient nvidia.com/mig-2g.10gb" - The RCA agent investigates by querying node allocatable resources and pod resource claims
- Kestrel correlates: MIG resources exist, but in fragmented non-contiguous slices
The Investigation
Kestrel's RCA agent automatically investigates the scheduling failure by gathering evidence from multiple sources:
- Pod events: Captures the
FailedSchedulingevent and extracts the specific resource that's insufficient (nvidia.com/mig-2g.10gb) - Node capacity: Queries each node's allocatable MIG resources to understand what profiles are theoretically available
- Current allocations: Maps which pods are consuming which MIG slices on which nodes
- Fragmentation analysis: Calculates whether the "available" GPU memory exists as contiguous slices or is fragmented across non-adjacent positions
The RCA agent correlates this data to determine the root cause: while raw GPU memory is available, the existing 1g.5gb allocations have fragmented the GPU in a way that prevents creating a contiguous 2g.10gb instance.
The Fix
Kestrel analyzes the fragmentation scenario and generates targeted fix recommendations. In this case, it identifies the most efficient solution: adjusting the training job's resource request to match available MIG slices.
Recommended: Adjust the Training Job's MIG Profile
Kestrel's recommended fix is to modify the training job's MIG profile request. This is the least disruptive option - it doesn't evict running workloads or require node maintenance. It recognizes that the training workload can run on a smaller MIG slice, and this change allows it to schedule immediately on available resources:
This fix works because 1g.5gb slices are available (the same profile the inference pods use), and a single-slice allocation doesn't require contiguous space. The training job can start immediately without waiting for other workloads to be evicted.
Alternative Approaches
Depending on your workload requirements, other remediation paths may be appropriate:
Evict Inference Pods
If the training job required a 2g.10gb profile (e.g., the model didn't fit in 5GB), you can evict inference pods to free contiguous slices. However, this disrupts running inference workloads:
Evict inference pods to free contiguous slices
# Evict 2 inference pods to free contiguous slices
kubectl get pods -n mig-demo -l app=mig-inference -o name | head -2 | xargs kubectl delete -n mig-demo
# Watch the training job transition to Running (30-60 seconds)
kubectl get pods -n mig-demo -wAfter eviction, the MIG Manager automatically reconfigures the GPU - destroying the freed 1g.5gb instances and creating a 2g.10gb partition from the contiguous slices. The training job schedules within 30-60 seconds.
You can automate this with PriorityClasses so training jobs automatically preempt lower-priority inference:
priorityclass.yaml - Enable automatic preemption
# High priority for training jobs
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-training-high
value: 1000000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "High priority for GPU training jobs - can preempt inference"
---
# Low priority for inference pods
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-inference-low
value: 100000
preemptionPolicy: Never
globalDefault: false
description: "Low priority for inference - can be preempted by training"Reconfigure MIG Profiles on the Node
For persistent fragmentation issues, you can reconfigure the GPU's MIG partitioning entirely. This is the most heavyweight option - it requires draining the node and causes downtime for all GPU workloads on that node:
MIG reconfiguration steps
# Step 1: Drain the node (evicts ALL pods)
kubectl drain gpu-node-1 --ignore-daemonsets --delete-emptydir-data
# Step 2: SSH to node and reconfigure MIG (or use GPU Operator config)
nvidia-smi mig -dgi # Destroy all GPU instances
nvidia-smi mig -dgci # Destroy all compute instances
# Step 3: Create a balanced MIG configuration
nvidia-smi mig -cgi 9,14,19 -C # Create 3g.20gb + 2g.10gb + 2g.10gb
# Step 4: Uncordon the node
kubectl uncordon gpu-node-1This approach makes sense when your MIG configuration fundamentally doesn't match your workload mix - for example, if you've been running all 1g.5gb profiles but now need regular access to larger slices. For teams using the NVIDIA GPU Operator, you can manage this declaratively with a MIGConfig resource.
Learning from Tribal Knowledge
MIG configuration is notoriously underdocumented within organizations. That Slack thread where someone figured out the optimal MIG split for your workload mix? The Confluence page explaining why you settled on 3g.20gb profiles for training? Kestrel searches your connected knowledge sources to surface this context alongside its automated analysis.
When investigating a MIG fragmentation incident, Kestrel might surface:
- A past incident where the team resolved fragmentation by adjusting the node's MIG profile mix
- Documentation explaining which MIG profiles work best for your specific model sizes
- A Jira ticket noting that inference pods should be scheduled with
preemptionPolicy: Never
Best Practices for MIG Clusters
Based on patterns Kestrel has observed across GPU clusters:
- Use PriorityClasses: Define clear priorities between training and inference workloads. Training jobs with deadlines should preempt inference pods, not wait indefinitely.
- Right-size your MIG profiles: If most of your training needs
2g.10gb, don't configure nodes with only1g.5gbprofiles. Match your MIG configuration to your workload mix. - Consider dedicated node pools: Separate inference-only nodes (many small MIG slices) from training nodes (fewer large slices) to prevent fragmentation conflicts.
- Monitor MIG utilization, not just GPU utilization: Standard GPU metrics don't show fragmentation. Track allocatable vs. allocated MIG resources per profile.
Try It Yourself
Want to see Kestrel detect and resolve MIG fragmentation in real-time? Here's how:
- Clone the demo repository:
git clone https://github.com/KestrelAI/Demos.git - Ensure your cluster has:
- NVIDIA GPU Operator with MIG enabled
- At least one A100 or H100 node with MIG mode active
- Kestrel Operator deployed (installation guide)
- Deploy the demo:
cd Demos/mig-fragmentation-demo helm install mig-demo ./chart --namespace mig-demo --create-namespace - Watch the training job get stuck:
kubectl get pods -n mig-demo -w - See Kestrel detect the incident and generate fixes in your dashboard
Start Your Free Trial
Get 2 weeks free to test Kestrel with your own GPU infrastructure. Detect MIG fragmentation before it blocks your ML pipelines.
Register for Free Trial