Your A100s Show 40GB Free, But Training Jobs Won't Schedule. Here's Why.

The Hidden Cost of GPU Sharing

NVIDIA's Multi-Instance GPU (MIG) technology transformed how organizations share expensive A100 and H100 GPUs. Instead of dedicating an entire 40GB A100 to a small inference workload, you can slice it into up to 7 isolated instances - each with its own compute, memory, and cache.

But MIG introduces a problem that doesn't exist with traditional GPU allocation: fragmentation. Unlike CPU or memory, MIG partitions follow strict geometric constraints. Not every combination of slices is valid, and the slices you create affect which slices you can create next.

The result? Your monitoring dashboard shows 40GB of GPU memory "available," but your training job requesting a 2g.10gb MIG instance sits in Pending indefinitely. The memory exists - it's just in the wrong shape.

Understanding MIG Geometry

An A100 40GB GPU can be partitioned into MIG instances following specific profiles. Each profile name indicates compute and memory: 1g.5gb means 1 GPU slice with ~5GB memory, 2g.10gb means 2 GPU slices with ~10GB memory.

A100 40GB MIG Profiles

Profile          GPU Slices    Memory    Max Instances per GPU
─────────────────────────────────────────────────────────────
1g.5gb           1/7           5GB       7
2g.10gb          2/7           10GB      3
3g.20gb          3/7           20GB      2
4g.20gb          4/7           20GB      1
7g.40gb          7/7           40GB      1

The critical constraint: MIG instances must be created from contiguous GPU slices. You can't combine slice 0-1 with slice 5-6 to make a 4g.20gb. This is where fragmentation happens.

Example: How fragmentation occurs on a single A100

INITIAL STATE: Full 40GB A100 available
┌─────────────────────────────────────────────────────────────────────────┐
│  [        Slice 0-6: Unpartitioned 7g.40gb available        ]           │
└─────────────────────────────────────────────────────────────────────────┘

AFTER: Two inference pods request 1g.5gb each
┌─────────────────────────────────────────────────────────────────────────┐
│  [1g.5gb]  [1g.5gb]  [          5 slices FREE          ]                │
│   Pod A     Pod B            (~25GB available)                          │
└─────────────────────────────────────────────────────────────────────────┘

AFTER: Training pod requests 3g.20gb - SUCCEEDS (contiguous slices 2-4)
┌─────────────────────────────────────────────────────────────────────────┐
│  [1g.5gb]  [1g.5gb]  [    3g.20gb    ]  [  2 slices FREE  ]             │
│   Pod A     Pod B       Training C         (10GB available)             │
└─────────────────────────────────────────────────────────────────────────┘

AFTER: Another inference pod requests 1g.5gb
┌─────────────────────────────────────────────────────────────────────────┐
│  [1g.5gb]  [1g.5gb]  [    3g.20gb    ]  [1g.5gb] [1 slice]              │
│   Pod A     Pod B       Training C       Pod D     FREE                 │
└─────────────────────────────────────────────────────────────────────────┘

NOW: New training job requests 2g.10gb - FAILS!
┌─────────────────────────────────────────────────────────────────────────┐
│  [1g.5gb]  [1g.5gb]  [    3g.20gb    ]  [1g.5gb] [1 slice]              │
│   Pod A     Pod B       Training C       Pod D     FREE                 │
│                                                     ↑                   │
│                                          Only 1 contiguous slice!       │
│                                          Need 2 for 2g.10gb             │
└─────────────────────────────────────────────────────────────────────────┘

RESULT: 5GB "free" but 2g.10gb (10GB) cannot be allocated
        The free slice is not contiguous with anything useful

Demo Setup

We've created a Helm chart that demonstrates MIG fragmentation on any Kubernetes cluster with NVIDIA GPUs and the GPU Operator installed. The demo deploys a mix of inference and training workloads that create fragmentation, then shows how Kestrel detects and resolves it.

KestrelAI/Demos - MIG Fragmentation Demo

Clone this repo and follow along with your own GPU cluster.

Prerequisites

This demo requires NVIDIA A100 or H100 GPUs with MIG (Multi-Instance GPU) enabled. MIG is only available on these GPU architectures - standard GPUs like T4, V100, or L4 won't work.

Setting Up a GKE Cluster with MIG

To demonstrate MIG fragmentation, we need the NVIDIA GPU Operator with mixed MIG strategy. This allows different pods to request different MIG profiles, with the operator dynamically creating instances on-demand.

Important: GKE requires specific configuration for the GPU Operator due to Container-Optimized OS (COS) filesystem constraints. The configuration below has been tested and verified to work.

Step 1: Create GKE cluster with A100 GPUs

# Create a GKE Standard cluster with A100 GPUs
# Do NOT use --gpu-partition-size (we want dynamic MIG via GPU Operator)
gcloud container clusters create mig-demo-cluster \
  --zone us-central1-a \
  --machine-type a2-highgpu-1g \
  --accelerator type=nvidia-tesla-a100,count=1 \
  --num-nodes=1 \
  --release-channel rapid

Step 2: Create PriorityClass for GPU Operator

# GKE has ResourceQuota that blocks system-critical PriorityClasses
# Create a custom PriorityClass for GPU Operator components
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: nvidia-gpu-operator
value: 100000
globalDefault: false
description: "Priority for NVIDIA GPU Operator components"
EOF

Step 3: Install NVIDIA GPU Operator with GKE-specific settings

# Add NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator with GKE-specific configuration
# Key settings:
#   - driver.enabled=false: Use GKE's pre-installed NVIDIA drivers
#   - toolkit.installDir: Use exec-capable path (COS mounts /var with noexec)
#   - hostPaths.driverInstallDir: Where GKE installs NVIDIA drivers
#   - Custom PriorityClass to avoid GKE ResourceQuota restrictions
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=false \
  --set toolkit.installDir=/home/kubernetes/bin/nvidia-toolkit \
  --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \
  --set mig.strategy=mixed \
  --set migManager.enabled=true \
  --set migManager.env[0].name=MIG_PARTED_REBOOT_IF_REQUIRED \
  --set migManager.env[0].value=true \
  --set migManager.env[1].name=WITH_REBOOT \
  --set migManager.env[1].value=true \
  --set dcgmExporter.enabled=false \
  --set daemonsets.priorityClassName=nvidia-gpu-operator \
  --set operator.priorityClassName=nvidia-gpu-operator \
  --set node-feature-discovery.priorityClassName=nvidia-gpu-operator

Step 4: Enable MIG mode on the GPU node

# Wait for GPU Operator pods to be running
kubectl get pods -n gpu-operator -w

# Once nvidia-mig-manager is running, apply MIG configuration
# This creates 7x 1g.5gb MIG instances on the A100
NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl label node $NODE nvidia.com/mig.config=all-1g.5gb --overwrite

# The MIG Manager will reboot the node to enable MIG mode
# Wait for the node to come back up (2-3 minutes)
kubectl get nodes -w

# Verify MIG is configured
kubectl get node $NODE -o jsonpath='{.metadata.labels}' | grep mig.config.state
# Should show: "nvidia.com/mig.config.state":"success"

Why these settings?

driver.enabled=false - GKE pre-installs NVIDIA drivers on GPU nodes
toolkit.installDir=/home/kubernetes/bin/nvidia-toolkit - COS mounts /var with noexec; this path is exec-capable
hostPaths.driverInstallDir - Points to GKE's driver installation location
MIG_PARTED_REBOOT_IF_REQUIRED=true - Allows MIG Manager to reboot node to enable MIG mode
dcgmExporter.enabled=false - DCGM profiling fails on GKE; not needed for MIG
Custom PriorityClass - GKE's ResourceQuota blocks pods with system-node-critical PriorityClass

Wait for all GPU Operator pods to be ready (this may take 2-3 minutes):

# Watch GPU Operator pods come up
kubectl get pods -n gpu-operator -w

# You should see these pods running:
# - gpu-operator (controller)
# - nvidia-container-toolkit-daemonset
# - nvidia-device-plugin-daemonset
# - nvidia-mig-manager
# - gpu-feature-discovery
# - nvidia-dcgm-exporter
# - nvidia-operator-validator

The mig.strategy=mixed setting enables dynamic MIG partitioning. When pods request different MIG profiles (1g.5gb, 2g.10gb, etc.), the GPU Operator creates them on-demand from available GPU slices.

Cost Note: A100 GPUs are expensive (~$3-4/hour per node on GKE). This demo only needs to run for a few minutes - remember to delete the cluster when you're done.

Verify the GPU Operator is configured correctly:

# Check all GPU Operator pods are running
kubectl get pods -n gpu-operator

# Verify MIG manager is active and configured for mixed strategy
kubectl get nodes -o jsonpath='{.items[0].metadata.labels}' | jq . | grep mig

# Check node GPU capacity (will show nvidia.com/gpu: 1 initially)
kubectl get nodes -o json | jq '.items[].status.allocatable | with_entries(select(.key | contains("nvidia")))'

Finally, deploy the Kestrel Operator to your cluster so it can detect the fragmentation incident.

Architecture Overview

The Helm chart deploys three components that create the fragmentation scenario:

values.yaml - Demo Configuration

# MIG Fragmentation Demo Configuration
# This creates a fragmentation scenario on A100 GPUs

namespace: mig-demo

# Phase 1: Deploy inference workloads that consume small MIG slices
inference:
  enabled: true
  replicas: 4
  resources:
    requests:
      nvidia.com/mig-1g.5gb: 1  # Small slices scattered across GPUs
  image: nvcr.io/nvidia/pytorch:24.01-py3
  command: ["python", "-c", "import torch; torch.cuda.is_available(); import time; time.sleep(86400)"]

# Phase 2: Deploy a training workload that needs a medium MIG slice
# This will get stuck due to fragmentation
training:
  enabled: true
  replicas: 1
  resources:
    requests:
      nvidia.com/mig-2g.10gb: 1  # Medium slice - will fail to schedule
  image: nvcr.io/nvidia/pytorch:24.01-py3
  command: ["python", "-c", "print('Training started'); import time; time.sleep(86400)"]

# Phase 3: Deploy more inference to maximize fragmentation
fragmenter:
  enabled: true
  replicas: 2
  resources:
    requests:
      nvidia.com/mig-1g.5gb: 1
  image: nvcr.io/nvidia/pytorch:24.01-py3
  command: ["sleep", "infinity"]

The chart deploys workloads in phases to create realistic fragmentation:

Phase 1: Four inference pods claim 1g.5gb MIG slices, scattering small allocations across available GPUs
Phase 2: A training job requests 2g.10gb - but no GPU has 2 contiguous slices free
Phase 3: More inference pods arrive, consuming any remaining small slices and worsening fragmentation

┌──────────────────────────────────────────────────────────────────────────┐
│                        Kubernetes Cluster                                │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │                         GPU Node Pool                              │  │
│  │                                                                    │  │
│  │  ┌─────────────────────────┐    ┌─────────────────────────┐        │  │
│  │  │   gpu-node-1 (A100)     │    │   gpu-node-2 (A100)     │        │  │
│  │  │                         │    │                         │        │  │
│  │  │  MIG Configuration:     │    │  MIG Configuration:     │        │  │
│  │  │  ┌───┬───┬───┬───┐      │    │  ┌───┬───┬───┬───┐      │        │  │
│  │  │  │1g │1g │ - │1g │      │    │  │1g │ - │1g │ - │      │        │  │
│  │  │  │5gb│5gb│   │5gb│      │    │  │5gb│   │5gb│   │      │        │  │
│  │  │  └───┴───┴───┴───┘      │    │  └───┴───┴───┴───┘      │        │  │
│  │  │    ↑   ↑       ↑        │    │    ↑       ↑            │        │  │
│  │  │  inf-1 inf-2  inf-3     │    │  inf-4   frag-1         │        │  │
│  │  │                         │    │                         │        │  │
│  │  │  Free: 4 non-contiguous │    │  Free: 5 non-contiguous │        │  │
│  │  │        slices           │    │        slices           │        │  │
│  │  └─────────────────────────┘    └─────────────────────────┘        │  │
│  │                                                                    │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │                        Pending Pods                                │  │
│  │                                                                    │  │
│  │  training-job-0: Pending                                           │  │
│  │    └─ Requests: nvidia.com/mig-2g.10gb: 1                          │  │
│  │    └─ Message: "0/2 nodes available: insufficient                  │  │
│  │                 nvidia.com/mig-2g.10gb"                            │  │
│  │                                                                    │  │
│  │  NOTE: Both nodes have ~25GB GPU memory "free" but no 2g.10gb      │  │
│  │        instance can be created due to non-contiguous slices        │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │                      Kestrel Operator                              │  │
│  │                                                                    │  │
│  │  Monitors: Pod scheduling failures, MIG resource allocation        │  │
│  │  Streams: Events + Pod status to Kestrel Cloud                     │  │
│  └────────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────┘

The Helm Chart

Let's walk through the key templates that create this scenario:

Inference Deployment (Fragment Creators)

These pods consume small 1g.5gb MIG slices and stay running indefinitely, simulating always-on inference services:

templates/inference-deployment.yaml

{{- if .Values.inference.enabled }}
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-inference
  namespace: {{ .Values.namespace }}
  labels:
    app: mig-inference
    demo: mig-fragmentation
spec:
  replicas: {{ .Values.inference.replicas }}
  selector:
    matchLabels:
      app: mig-inference
  template:
    metadata:
      labels:
        app: mig-inference
        workload-type: inference
        priority: low
    spec:
      # Use pod anti-affinity to spread inference pods across GPUs
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: mig-inference
              topologyKey: "nvidia.com/gpu"
      containers:
      - name: inference
        image: {{ .Values.inference.image }}
        command: 
        {{- range .Values.inference.command }}
        - {{ . | quote }}
        {{- end }}
        resources:
          requests:
            nvidia.com/mig-1g.5gb: {{ .Values.inference.resources.requests | dig "nvidia.com/mig-1g.5gb" 1 }}
          limits:
            nvidia.com/mig-1g.5gb: {{ .Values.inference.resources.requests | dig "nvidia.com/mig-1g.5gb" 1 }}
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
{{- end }}

Training Job (The Victim)

This Job requests a 2g.10gb MIG instance - the profile that will fail due to fragmentation:

templates/training-job.yaml

{{- if .Values.training.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ .Release.Name }}-training
  namespace: {{ .Values.namespace }}
  labels:
    app: mig-training
    demo: mig-fragmentation
spec:
  backoffLimit: 0  # Don't retry - we want to see the scheduling failure
  template:
    metadata:
      labels:
        app: mig-training
        workload-type: training
        priority: high
    spec:
      restartPolicy: Never
      containers:
      - name: training
        image: {{ .Values.training.image }}
        command:
        {{- range .Values.training.command }}
        - {{ . | quote }}
        {{- end }}
        resources:
          requests:
            # This MIG profile requires 2 contiguous GPU slices
            # Fragmentation from 1g.5gb pods will prevent scheduling
            nvidia.com/mig-2g.10gb: {{ .Values.training.resources.requests | dig "nvidia.com/mig-2g.10gb" 1 }}
          limits:
            nvidia.com/mig-2g.10gb: {{ .Values.training.resources.requests | dig "nvidia.com/mig-2g.10gb" 1 }}
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
{{- end }}

What Kestrel Sees

Within a minute of the training job failing to schedule, Kestrel detects the incident:

The pod status monitor sees the training job in Pending state
The event monitor captures the FailedScheduling event with reason: "insufficient nvidia.com/mig-2g.10gb"
The RCA agent investigates by querying node allocatable resources and pod resource claims
Kestrel correlates: MIG resources exist, but in fragmented non-contiguous slices

The Investigation

Kestrel's RCA agent automatically investigates the scheduling failure by gathering evidence from multiple sources:

Pod events: Captures the FailedScheduling event and extracts the specific resource that's insufficient (nvidia.com/mig-2g.10gb)
Node capacity: Queries each node's allocatable MIG resources to understand what profiles are theoretically available
Current allocations: Maps which pods are consuming which MIG slices on which nodes
Fragmentation analysis: Calculates whether the "available" GPU memory exists as contiguous slices or is fragmented across non-adjacent positions

The RCA agent correlates this data to determine the root cause: while raw GPU memory is available, the existing 1g.5gb allocations have fragmented the GPU in a way that prevents creating a contiguous 2g.10gb instance.

The Fix

Kestrel analyzes the fragmentation scenario and generates targeted fix recommendations. In this case, it identifies the most efficient solution: adjusting the training job's resource request to match available MIG slices.

Recommended: Adjust the Training Job's MIG Profile

Kestrel's recommended fix is to modify the training job's MIG profile request. This is the least disruptive option - it doesn't evict running workloads or require node maintenance. It recognizes that the training workload can run on a smaller MIG slice, and this change allows it to schedule immediately on available resources:

This fix works because 1g.5gb slices are available (the same profile the inference pods use), and a single-slice allocation doesn't require contiguous space. The training job can start immediately without waiting for other workloads to be evicted.

Alternative Approaches

Depending on your workload requirements, other remediation paths may be appropriate:

Evict Inference Pods

If the training job required a 2g.10gb profile (e.g., the model didn't fit in 5GB), you can evict inference pods to free contiguous slices. However, this disrupts running inference workloads:

Evict inference pods to free contiguous slices

# Evict 2 inference pods to free contiguous slices
kubectl get pods -n mig-demo -l app=mig-inference -o name | head -2 | xargs kubectl delete -n mig-demo

# Watch the training job transition to Running (30-60 seconds)
kubectl get pods -n mig-demo -w

After eviction, the MIG Manager automatically reconfigures the GPU - destroying the freed 1g.5gb instances and creating a 2g.10gb partition from the contiguous slices. The training job schedules within 30-60 seconds.

You can automate this with PriorityClasses so training jobs automatically preempt lower-priority inference:

priorityclass.yaml - Enable automatic preemption

# High priority for training jobs
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-training-high
value: 1000000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "High priority for GPU training jobs - can preempt inference"
---
# Low priority for inference pods
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-inference-low
value: 100000
preemptionPolicy: Never
globalDefault: false
description: "Low priority for inference - can be preempted by training"

Reconfigure MIG Profiles on the Node

For persistent fragmentation issues, you can reconfigure the GPU's MIG partitioning entirely. This is the most heavyweight option - it requires draining the node and causes downtime for all GPU workloads on that node:

MIG reconfiguration steps

# Step 1: Drain the node (evicts ALL pods)
kubectl drain gpu-node-1 --ignore-daemonsets --delete-emptydir-data

# Step 2: SSH to node and reconfigure MIG (or use GPU Operator config)
nvidia-smi mig -dgi  # Destroy all GPU instances
nvidia-smi mig -dgci # Destroy all compute instances

# Step 3: Create a balanced MIG configuration
nvidia-smi mig -cgi 9,14,19 -C  # Create 3g.20gb + 2g.10gb + 2g.10gb

# Step 4: Uncordon the node
kubectl uncordon gpu-node-1

This approach makes sense when your MIG configuration fundamentally doesn't match your workload mix - for example, if you've been running all 1g.5gb profiles but now need regular access to larger slices. For teams using the NVIDIA GPU Operator, you can manage this declaratively with a MIGConfig resource.

Learning from Tribal Knowledge

MIG configuration is notoriously underdocumented within organizations. That Slack thread where someone figured out the optimal MIG split for your workload mix? The Confluence page explaining why you settled on 3g.20gb profiles for training? Kestrel searches your connected knowledge sources to surface this context alongside its automated analysis.

When investigating a MIG fragmentation incident, Kestrel might surface:

A past incident where the team resolved fragmentation by adjusting the node's MIG profile mix
Documentation explaining which MIG profiles work best for your specific model sizes
A Jira ticket noting that inference pods should be scheduled with preemptionPolicy: Never

Best Practices for MIG Clusters

Based on patterns Kestrel has observed across GPU clusters:

Use PriorityClasses: Define clear priorities between training and inference workloads. Training jobs with deadlines should preempt inference pods, not wait indefinitely.
Right-size your MIG profiles: If most of your training needs 2g.10gb, don't configure nodes with only 1g.5gb profiles. Match your MIG configuration to your workload mix.
Consider dedicated node pools: Separate inference-only nodes (many small MIG slices) from training nodes (fewer large slices) to prevent fragmentation conflicts.
Monitor MIG utilization, not just GPU utilization: Standard GPU metrics don't show fragmentation. Track allocatable vs. allocated MIG resources per profile.

Try It Yourself

Want to see Kestrel detect and resolve MIG fragmentation in real-time? Here's how:

Clone the demo repository:
git clone https://github.com/KestrelAI/Demos.git
Ensure your cluster has:
- NVIDIA GPU Operator with MIG enabled
- At least one A100 or H100 node with MIG mode active
- Kestrel Operator deployed (installation guide)
Deploy the demo:
cd Demos/mig-fragmentation-demo helm install mig-demo ./chart --namespace mig-demo --create-namespace
Watch the training job get stuck:
kubectl get pods -n mig-demo -w
See Kestrel detect the incident and generate fixes in your dashboard

Start Your Free Trial

Get 2 weeks free to test Kestrel with your own GPU infrastructure. Detect MIG fragmentation before it blocks your ML pipelines.