From Silicon to Container: The Complete Journey of GPU Provisioning in Kubernetes

A deep dive into how Kubernetes makes GPUs accessible to containers, from bare metal to CUDA applications

Introduction

When a application container request a GPU resource in a Kubernetes with a simple nvidia.com/gpu: 1 resource request/limit, an intricate dance of kernel drivers, container runtimes, device plugins, and orchestration layers springs into action. This journey from physical hardware to a running CUDA application involves multiple abstraction layers working in concert.

In this comprehensive guide, we’ll explore every layer of this stack—from PCIe device files to the emerging Container Device Interface (CDI) standard—revealing the elegant complexity that makes GPU-accelerated containerized workloads possible.

Layer 1: Hardware & Kernel Foundation
Layer 2: Container Runtime GPU Access
Layer 3: CUDA in Containers
Layer 4: Kubernetes GPU Scheduling
Layer 5: GPU Isolation & Visibility
Layer 6: Complete Flow Example
Layer 7: Advanced GPU Sharing
The Container Device Interface (CDI) Revolution
Dynamic Resource Allocation (DRA): Next-Generation GPU Scheduling
Conclusion

Layer 1: Hardware & Kernel Foundation

Physical GPU Access

At the most fundamental level, a GPU is a PCIe device connected to the host system. The Linux kernel communicates with it through a sophisticated driver stack.

GPU Driver Architecture

The NVIDIA driver (similar concepts apply to AMD and Intel) consists of several kernel modules:

nvidia.ko              # Core driver module
nvidia-uvm.ko          # Unified Memory module
nvidia-modeset.ko      # Display mode setting
nvidia-drm.ko          # Direct Rendering Manager

When loaded, these modules create device files in /dev/:

/dev/nvidia0           # First GPU device
/dev/nvidia1           # Second GPU device
/dev/nvidiactl         # Control device for driver management
/dev/nvidia-uvm        # Unified Virtual Memory device
/dev/nvidia-uvm-tools  # UVM debugging and profiling
/dev/nvidia-modeset    # Mode setting operations

These character devices provide the fundamental interface between userspace applications and GPU hardware.

Device File Permissions

Device files have specific ownership and permissions:

# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Oct 23 09:00 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Oct 23 09:00 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Oct 23 09:00 /dev/nvidiactl
crw-rw-rw- 1 root root 509,   0 Oct 23 09:00 /dev/nvidia-uvm

The major (195 for nvidia devices, 509 for UVM) and minor number(0 or 1) are registered with the Linux kernel and used by the device controller to route operations to the correct driver.

Layer 2: Container Runtime GPU Access

The Container Isolation Challenge

Containers use Linux namespaces to create isolated environments. By default, a container cannot access the host’s GPU devices because:

Device namespace isolation: Container has its own /dev filesystem
cgroups device controller: Controls which devices a process can access
Mount namespace: Container filesystem doesn’t include host device files

NVIDIA Container Toolkit: Bridging the Gap

The NVIDIA Container Toolkit (formerly nvidia-docker2) solves this problem by modifying the container creation process.

Component Architecture

┌─────────────────────────────────────────┐
│   Container Runtime (Docker/containerd) │
└──────────────┬──────────────────────────┘
               │
               ↓
┌──────────────────────────────────────────┐
│   nvidia-container-runtime               │
│   (OCI-compliant runtime wrapper)        │
└──────────────┬───────────────────────────┘
               │
               ↓
┌──────────────────────────────────────────┐
│   nvidia-container-runtime-hook          │
│   (Prestart hook)                        │
└──────────────┬───────────────────────────┘
               │
               ↓
┌──────────────────────────────────────────┐
│   nvidia-container-cli                   │
│   (Performs actual GPU provisioning)     │
└──────────────────────────────────────────┘

What Gets Mounted Into the Container

When a container requests GPU access, the NVIDIA Container Toolkit mounts:

Device Files:

/dev/nvidia0              # GPU device
/dev/nvidia1              # Additional GPUs
/dev/nvidiactl            # Control device
/dev/nvidia-uvm           # Unified Memory device
/dev/nvidia-uvm-tools     # UVM tools
/dev/nvidia-modeset       # Mode setting

Driver Libraries (from host):

/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.104.05
/usr/lib/x86_64-linux-gnu/libcuda.so.535.104.05
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.535.104.05
# ... and few more

Utilities:

/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced

cgroups Device Permissions

The toolkit also configures cgroups to allow device access:

# In the container's cgroup
devices.allow: c 195:* rwm    # Allow all NVIDIA devices (major 195)
devices.allow: c 195:255 rwm  # Allow nvidiactl
devices.allow: c 509:* rwm    # Allow nvidia-uvm devices (major 509)

The format c 195:* rwm means:

c: Character device
195: Major number (NVIDIA devices)
*: All minor numbers (all GPUs)
rwm: Read, write, and mknod permissions

Layer 3: CUDA in Containers

Understanding the CUDA Stack

CUDA applications communicate with GPUs through a layered software stack:

┌──────────────────────────────┐
│   Your CUDA Application      │
│   (compiled with nvcc)       │
└─────────────┬────────────────┘
              │
              ↓
┌──────────────────────────────┐
│   CUDA Runtime API           │
│   (libcudart.so)             │
│   - cudaMalloc()             │
│   - cudaMemcpy()             │
│   - kernel<<<>>>()           │
└─────────────┬────────────────┘
              │
              ↓
┌──────────────────────────────┐
│   CUDA Driver API            │
│   (libcuda.so)               │
│   - cuMemAlloc()             │
│   - cuLaunchKernel()         │
└─────────────┬────────────────┘
              │
              ↓
┌──────────────────────────────┐
│   Kernel Driver              │
│   (nvidia.ko)                │
└─────────────┬────────────────┘
              │
              ↓
┌──────────────────────────────┐
│   Physical GPU Hardware      │
└──────────────────────────────┘

CUDA in a Containerized Environment

When user run a CUDA application inside a container, the call stack looks like:

[Container] CUDA Application
                ↓
[Container] libcudart.so (CUDA Runtime)
                ↓
[Mounted from Host] libcuda.so (CUDA Driver Library)
                ↓
[ioctl() system calls]
                ↓
[Mounted Device] /dev/nvidia0
                ↓
[Host Kernel] nvidia.ko driver
                ↓
[Physical Hardware] GPU

The Critical Driver Compatibility Requirement

Key Point: The libcuda.so driver library version must match the host kernel driver version. That is why its preferred to mount the driver library from the host rather than packaging it in the container image.

Example compatibility matrix:

Host Driver Version    Compatible CUDA Toolkit Versions
-------------------    --------------------------------
535.104.05            CUDA 11.0 - 12.2
525.85.12             CUDA 11.0 - 12.1
515.65.01             CUDA 11.0 - 11.8

The CUDA toolkit in container must be compatible with the host’s driver version, but it doesn’t need to match exactly — newer drivers support older CUDA toolkits.

A Simple CUDA Example

Here’s what happens when you run a basic CUDA program:

#include <stdio.h>
#include <cuda_runtime.h>

int main() {
    float *d_data;
    size_t size = 1024 * sizeof(float);
    
    // This triggers the entire stack
    cudaError_t err = cudaMalloc(&d_data, size);
    
    if (err == cudaSuccess) {
        printf("Successfully allocated %zu bytes on GPU\n", size);
        cudaFree(d_data);
    }
    
    return 0;
}

Behind the scenes:

cudaMalloc() calls cuMemAlloc() in libcuda.so
libcuda.so opens /dev/nvidia0
Issues ioctl() system call with NVIDIA_IOCTL_ALLOC_MEM
Kernel driver nvidia.ko receives the request
Driver checks cgroups: “Is this process allowed to access device 195:0?”
If allowed, driver allocates GPU memory
Returns device memory pointer to application

Layer 4: Kubernetes GPU Scheduling

The Device Plugin Framework

Kubernetes uses an extensible Device Plugin system to manage specialized hardware like GPUs, FPGAs, and InfiniBand adapters.

Architecture Overview

┌────────────────────────────────────────┐
│   kube-apiserver                       │
│   (Node status: nvidia.com/gpu: 4)     │
└───────────────┬────────────────────────┘
                │
                ↓
┌────────────────────────────────────────┐
│   kube-scheduler                       │
│   (Finds nodes with requested GPUs)    │
└───────────────┬────────────────────────┘
                │
                ↓
┌────────────────────────────────────────┐
│   kubelet (on GPU node)                │
│   - Discovers device plugins           │
│   - Tracks GPU allocation              │
│   - Calls Allocate() for pods          │
└───────────────┬────────────────────────┘
                │
                ↓
┌────────────────────────────────────────┐
│   NVIDIA Device Plugin (DaemonSet)     │
│   - Discovers GPUs (nvidia-smi)        │
│   - Registers with kubelet             │
│   - Allocates GPUs to containers       │
└────────────────────────────────────────┘

Device Plugin Discovery and Registration

The NVIDIA Device Plugin runs as a DaemonSet on every GPU node:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    spec:
      containers:
      - name: nvidia-device-plugin
        image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins

The Registration Process

Device Plugin Starts

nvidia-device-plugin container starts
           ↓
Queries GPUs: nvidia-smi --query-gpu=uuid --format=csv
           ↓
Discovers: GPU-a4f8c2d1, GPU-b3e9d4f2, GPU-c8f1a5b3, GPU-d2c7e9a4

Registration with kubelet

Device plugin connects to: unix:///var/lib/kubelet/device-plugins/kubelet.sock
           ↓
Sends Register() gRPC call:
{
  "version": "v1beta1",
  "endpoint": "nvidia.sock",
  "resourceName": "nvidia.com/gpu"
}

Advertising Resources

kubelet calls ListAndWatch() on device plugin
           ↓
Device plugin responds:
{
  "devices": [
    {"id": "GPU-a4f8c2d1", "health": "Healthy"},
    {"id": "GPU-b3e9d4f2", "health": "Healthy"},
    {"id": "GPU-c8f1a5b3", "health": "Healthy"},
    {"id": "GPU-d2c7e9a4", "health": "Healthy"}
  ]
}
           ↓
kubelet updates node status:
status.capacity.nvidia.com/gpu: "4"
status.allocatable.nvidia.com/gpu: "4"

Pod Scheduling Flow

Let’s trace a complete pod scheduling workflow:

Step 1: User Creates Pod

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.8.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 2  # Request 2 GPUs

Step 2: Scheduler Filters and Scores

kube-scheduler receives unscheduled pod
         ↓
Filtering Phase:
  - node-1: cpu OK, memory OK, nvidia.com/gpu=0 ✗ (no GPUs)
  - node-2: cpu OK, memory OK, nvidia.com/gpu=2 ✓
  - node-3: cpu OK, memory OK, nvidia.com/gpu=4 ✓
  - node-4: cpu ✗ (insufficient CPU)
         ↓
Scoring Phase:
  - node-2: score 85 (2 GPUs available, high utilization)
  - node-3: score 92 (4 GPUs available, moderate utilization)
         ↓
Selected: node-3
         ↓
Binding: pod assigned to node-3

Step 3: kubelet Allocates GPUs

kubelet on node-3 receives pod assignment
         ↓
For container "cuda-container" requesting 2 GPUs:
         ↓
kubelet calls: DevicePlugin.Allocate(deviceIds=["GPU-a4f8c2d1", "GPU-b3e9d4f2"])
         ↓
Device plugin responds:
{
  "containerResponses": [{
    "envs": {
      "NVIDIA_VISIBLE_DEVICES": "GPU-a4f8c2d1,GPU-b3e9d4f2"
    },
    "mounts": [{
      "hostPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.535.104.05",
      "containerPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.1",
      "readOnly": true
    }],
    "devices": [{
      "hostPath": "/dev/nvidia0",
      "containerPath": "/dev/nvidia0",
      "permissions": "rwm"
    }, {
      "hostPath": "/dev/nvidia1",
      "containerPath": "/dev/nvidia1",
      "permissions": "rwm"
    }]
  }]
}

Step 4: Container Runtime Provisions GPU

kubelet → containerd: CreateContainer with:
  - Environment: NVIDIA_VISIBLE_DEVICES=GPU-a4f8c2d1,GPU-b3e9d4f2
  - Mounts: driver libraries
  - Devices: /dev/nvidia0, /dev/nvidia1
         ↓
containerd calls: nvidia-container-runtime-hook (prestart)
         ↓
Hook configures:
  - Mounts all required device files
  - Mounts NVIDIA libraries
  - Sets up cgroups device controller
  - Configures environment variables
         ↓
Container starts with GPU access
         ↓
nvidia-smi inside container shows 2 GPUs

Layer 5: GPU Isolation & Visibility

The Magic of NVIDIA_VISIBLE_DEVICES

The NVIDIA_VISIBLE_DEVICES environment variable is the key to GPU isolation in containers. It controls which GPUs are visible to CUDA applications.

How It Works

Consider a host with 4 GPUs:

# On the host
$ nvidia-smi --query-gpu=index,uuid --format=csv
index, uuid
0, GPU-a4f8c2d1-e5f6-7a8b-9c0d-1e2f3a4b5c6d
1, GPU-b3e9d4f2-f6a7-8b9c-0d1e-2f3a4b5c6d7e
2, GPU-c8f1a5b3-a7b8-9c0d-1e2f-3a4b5c6d7e8f
3, GPU-d2c7e9a4-b8c9-0d1e-2f3a-4b5c6d7e8f9a

Container 1 configuration:

NVIDIA_VISIBLE_DEVICES=GPU-a4f8c2d1-e5f6-7a8b-9c0d-1e2f3a4b5c6d

# Inside container 1
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05   Driver Version: 535.104.05   CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
+-------------------------------+----------------------+----------------------+

Container 2 configuration:

NVIDIA_VISIBLE_DEVICES=GPU-b3e9d4f2-f6a7-8b9c-0d1e-2f3a4b5c6d7e,GPU-c8f1a5b3-a7b8-9c0d-1e2f-3a4b5c6d7e8f

# Inside container 2
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05   Driver Version: 535.104.05   CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1F.0 Off |                    0 |
|   1  Tesla V100-SXM2...  Off  | 00000000:00:20.0 Off |                    0 |
+-------------------------------+----------------------+----------------------+

Notice that:

Container 1 sees only 1 GPU (renumbered as GPU 0)
Container 2 sees 2 GPUs (renumbered as GPU 0 and 1)
Each container has its own isolated GPU namespace

Driver-Level Enforcement

When a CUDA application initializes:

cudaError_t err = cudaSetDevice(0);

The CUDA driver:

Reads NVIDIA_VISIBLE_DEVICES environment variable
Creates a virtual-to-physical GPU mapping

Only allows access to visible devices

cuInit() {
 visible_devices = getenv("NVIDIA_VISIBLE_DEVICES");
    
 if (visible_devices) {
     parse_and_filter_devices(visible_devices);
     // User's "GPU 0" maps to physical GPU as specified
 }
}

cgroups: Kernel-Level Protection

Environment variables provide application-level isolation, but cgroups enforce it at the kernel level.

For each container, cgroups device controller is configured:

Container 1:

# /sys/fs/cgroup/devices/kubepods/pod<uid>/<container-id>/devices.list
c 195:0 rwm      # Allow /dev/nvidia0 only
c 195:255 rwm    # Allow /dev/nvidiactl
c 509:0 rwm      # Allow /dev/nvidia-uvm

# Implicit deny for:
# c 195:1 (would be /dev/nvidia1)
# c 195:2 (would be /dev/nvidia2)
# c 195:3 (would be /dev/nvidia3)

Even if a malicious process inside Container 1 tries to open /dev/nvidia1, the kernel blocks it:

// Malicious code attempt
int fd = open("/dev/nvidia1", O_RDWR);
// Returns: -1 (EPERM - Operation not permitted)
// Kernel: cgroups device controller denied access

This provides defense-in-depth: both application-level (CUDA driver) and kernel-level (cgroups) isolation.

Layer 6: Complete Flow Example

Let’s trace a complete end-to-end flow from pod creation to CUDA memory allocation.

The Scenario

We’ll deploy a pod requesting 2 GPUs and run a simple CUDA program that allocates GPU memory.

Step 1: Deploy the Pod

apiVersion: v1
kind: Pod
metadata:
  name: cuda-mem-test
spec:
  restartPolicy: Never
  containers:
  - name: cuda-app
    image: nvidia/cuda:11.8.0-devel-ubuntu22.04
    command: ["./cuda_malloc_test"]
    resources:
      limits:
        nvidia.com/gpu: 2

$ kubectl apply -f cuda-pod.yaml
pod/cuda-mem-test created

Step 2: Scheduler Assignment

kube-scheduler watches for unscheduled pods
         ↓
Finds cuda-mem-test pod (status.phase: Pending)
         ↓
Queries all nodes for available resources:
  node-gpu-01: nvidia.com/gpu available: 0/4 (fully allocated)
  node-gpu-02: nvidia.com/gpu available: 2/4 ✓
  node-gpu-03: nvidia.com/gpu available: 4/4 ✓
         ↓
Applies scoring algorithms:
  node-gpu-02: score 75 (50% GPU utilization)
  node-gpu-03: score 90 (0% GPU utilization, better choice)
         ↓
Selects node-gpu-03
         ↓
Creates binding: pod cuda-mem-test → node-gpu-03
         ↓
Updates pod: status.nodeName: node-gpu-03

Step 3: kubelet Provisions Container

kubelet on node-gpu-03 receives pod assignment
         ↓
Examines resource requests: nvidia.com/gpu: 2
         ↓
Calls device plugin's Allocate() via gRPC:
{
  "containerRequests": [{
    "devicesIDs": ["GPU-uuid-1234", "GPU-uuid-5678"]
  }]
}
         ↓
Device plugin responds:
{
  "containerResponses": [{
    "devices": [
      {"hostPath": "/dev/nvidia0", "containerPath": "/dev/nvidia0"},
      {"hostPath": "/dev/nvidia1", "containerPath": "/dev/nvidia1"},
      {"hostPath": "/dev/nvidiactl", "containerPath": "/dev/nvidiactl"},
      {"hostPath": "/dev/nvidia-uvm", "containerPath": "/dev/nvidia-uvm"}
    ],
    "mounts": [
      {
        "hostPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.535.104.05",
        "containerPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.1"
      },
      {
        "hostPath": "/usr/bin/nvidia-smi",
        "containerPath": "/usr/bin/nvidia-smi"
      }
      // ... more libraries
    ],
    "envs": {
      "NVIDIA_VISIBLE_DEVICES": "GPU-uuid-1234,GPU-uuid-5678",
      "NVIDIA_DRIVER_CAPABILITIES": "compute,utility"
    }
  }]
}

Step 4: Container Runtime Configuration

kubelet → containerd CRI: CreateContainer
         ↓
containerd creates OCI spec:
{
  "linux": {
    "devices": [
      {"path": "/dev/nvidia0", "type": "c", "major": 195, "minor": 0},
      {"path": "/dev/nvidia1", "type": "c", "major": 195, "minor": 1},
      {"path": "/dev/nvidiactl", "type": "c", "major": 195, "minor": 255},
      {"path": "/dev/nvidia-uvm", "type": "c", "major": 509, "minor": 0}
    ],
    "resources": {
      "devices": [
        {"allow": false, "access": "rwm"},  // Deny all by default
        {"allow": true, "type": "c", "major": 195, "minor": 0, "access": "rwm"},
        {"allow": true, "type": "c", "major": 195, "minor": 1, "access": "rwm"},
        {"allow": true, "type": "c", "major": 195, "minor": 255, "access": "rwm"},
        {"allow": true, "type": "c", "major": 509, "minor": 0, "access": "rwm"}
      ]
    }
  },
  "mounts": [...],
  "process": {
    "env": [
      "NVIDIA_VISIBLE_DEVICES=GPU-uuid-1234,GPU-uuid-5678",
      "NVIDIA_DRIVER_CAPABILITIES=compute,utility"
    ]
  }
}
         ↓
containerd calls runc with nvidia-container-runtime-hook
         ↓
Hook performs final configuration and mounts
         ↓
Container starts

Step 5: CUDA Application Runs

Inside the container, our CUDA application executes:

#include <stdio.h>
#include <cuda_runtime.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("Visible GPUs: %d\n", deviceCount);
    
    for (int i = 0; i < deviceCount; i++) {
        cudaSetDevice(i);
        
        float *d_data;
        size_t size = 1024 * 1024 * 1024;  // 1 GB
        
        cudaError_t err = cudaMalloc(&d_data, size);
        if (err == cudaSuccess) {
            printf("GPU %d: Allocated 1 GB\n", i);
            cudaFree(d_data);
        }
    }
    
    return 0;
}

The execution flow:

Application calls: cudaGetDeviceCount(&deviceCount)
         ↓
CUDA Runtime (libcudart.so): cuDeviceGetCount()
         ↓
CUDA Driver (libcuda.so):
  - Reads NVIDIA_VISIBLE_DEVICES from environment
  - Parses: "GPU-uuid-1234,GPU-uuid-5678"
  - Returns: deviceCount = 2
         ↓
Application prints: "Visible GPUs: 2"
         ↓
Application calls: cudaMalloc(&d_data, 1GB) for GPU 0
         ↓
CUDA Runtime: cuMemAlloc(1073741824)  // 1 GB in bytes
         ↓
CUDA Driver:
  - Determines physical GPU from NVIDIA_VISIBLE_DEVICES mapping
  - Virtual GPU 0 → Physical GPU-uuid-1234 → /dev/nvidia0
  - Opens file descriptor: fd = open("/dev/nvidia0", O_RDWR)
         ↓
Kernel checks cgroups:
  - Process in cgroup: /kubepods/pod-xyz/container-abc
  - Requested device: major=195, minor=0
  - cgroups device allowlist: c 195:0 rwm ✓ ALLOWED
         ↓
Kernel forwards to nvidia.ko driver
         ↓
nvidia.ko driver:
  - Allocates 1 GB of GPU memory on physical GPU
  - Programs GPU memory controller
  - Returns device memory address: 0x7f8c40000000
         ↓
CUDA Driver returns to application
         ↓
Application prints: "GPU 0: Allocated 1 GB"
         ↓
[Repeat for GPU 1 with /dev/nvidia1]
         ↓
Application prints: "GPU 1: Allocated 1 GB"

System calls involved:

# Traced with strace
openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = 3
ioctl(3, NVIDIA_IOC_QUERY_DEVICE_CLASS, ...) = 0
ioctl(3, NVIDIA_IOC_CARD_INFO, ...) = 0
ioctl(3, NVIDIA_IOC_ALLOC_MEM, {size=1073741824, ...}) = 0
# ... GPU memory now allocated ...
ioctl(3, NVIDIA_IOC_FREE_MEM, ...) = 0
close(3) = 0

Modern GPU workloads often don’t need an entire GPU. Several technologies enable GPU sharing:

Multi-Instance GPU (MIG)

NVIDIA A100 and H100 GPUs support hardware-level partitioning into Multiple Instances.

MIG Architecture

A single A100 GPU can be divided into up to 7 instances:

Physical A100 (40GB)
├─ MIG Instance 0: 3g.20gb (3 compute slices, 20GB memory)
├─ MIG Instance 1: 3g.20gb (3 compute slices, 20GB memory)
├─ MIG Instance 2: 2g.10gb (2 compute slices, 10GB memory)
└─ MIG Instance 3: 1g.5gb  (1 compute slice, 5GB memory)

Each MIG instance:

Has dedicated compute resources (streaming multiprocessors)
Has dedicated memory partition
Provides hardware-level isolation
Appears as a separate GPU device

MIG Device Files

# Enable MIG mode
$ nvidia-smi -i 0 -mig 1

# Create MIG instances
$ nvidia-smi mig -cgi 3g.20gb -C
$ nvidia-smi mig -cgi 3g.20gb -C
$ nvidia-smi mig -cgi 1g.5gb -C

# New device files appear

$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Oct 23 09:00 /dev/nvidia0          # Parent GPU
crw-rw-rw- 1 root root 195, 255 Oct 23 09:00 /dev/nvidiactl
crw-rw-rw- 1 root root 509,   0 Oct 23 09:00 /dev/nvidia-uvm

# MIG device files
crw-rw-rw- 1 root root 195,   1 Oct 23 09:00 /dev/nvidia0mig0      # First 3g.20gb
crw-rw-rw- 1 root root 195,   2 Oct 23 09:00 /dev/nvidia0mig1      # Second 3g.20gb
crw-rw-rw- 1 root root 195,   3 Oct 23 09:00 /dev/nvidia0mig2      # 1g.5gb

MIG in Kubernetes

The NVIDIA Device Plugin discovers MIG instances and advertises them as separate resources:

apiVersion: v1
kind: Node
status:
  capacity:
    nvidia.com/mig-3g.20gb: "2"
    nvidia.com/mig-1g.5gb: "1"
  allocatable:
    nvidia.com/mig-3g.20gb: "2"
    nvidia.com/mig-1g.5gb: "1"

#Pods can request specific MIG profiles:
apiVersion: v1
kind: Pod
metadata:
  name: mig-pod
spec:
  containers:
  - name: cuda-app
    image: nvidia/cuda:11.8.0-base-ubuntu22.04
    resources:
      limits:
        nvidia.com/mig-3g.20gb: 1  # Request one 3g.20gb instance

MIG benefits

True hardware isolation (unlike time-slicing)
Guaranteed memory allocation
Fault isolation (one instance failure doesn’t affect others)
Quality of Service (QoS) guarantees

MIG Trade-offs

Partiations the GPU in as per device capabilities, less control over GPU partitioning layout

GPU Time-Slicing

For workloads that don’t require full GPU utilization, time-slicing allows multiple containers to share a single GPU.

Device Plugin ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        replicas: 4
        renameByDefault: false
        failRequestsGreaterThanOne: true
    resources:
      - name: nvidia.com/gpu
        devices: all

With this configuration:

Each physical GPU appears as 4 schedulable resources
Kubernetes can schedule 4 pods per GPU
All pods access the same physical GPU

How Time-Slicing Works

Pod 1 Container
     ↓
NVIDIA_VISIBLE_DEVICES=GPU-0
     ↓
cudaMalloc() → /dev/nvidia0

Pod 2 Container
     ↓
NVIDIA_VISIBLE_DEVICES=GPU-0  # Same GPU!
     ↓
cudaMalloc() → /dev/nvidia0

Pod 3 Container
     ↓
NVIDIA_VISIBLE_DEVICES=GPU-0  # Same GPU!
     ↓
cudaMalloc() → /dev/nvidia0

All containers:

See the same GPU device
Create separate CUDA contexts
GPU hardware time-multiplexes between contexts
No memory isolation (pods can see each other’s allocations!)

Time-slicing characteristics:

Pros:
- Easy to configure
- Works with any GPU
- Higher utilization for bursty workloads
Cons:
- No memory isolation (security risk)
- No performance(Qos) guarantees
- One container can starve others
- OOM on one container affects all

Best for:

Development/testing environments
Interactive workloads (Jupyter notebooks)
Bursty inference workloads with low duty cycle

vGPU (Virtual GPU)

NVIDIA vGPU technology provides software-defined GPU sharing with:

Hypervisor-level virtualization
Memory isolation between VMs
QoS policies and scheduling

Live migration support

Hypervisor (VMware vSphere / KVM)
├─ VM 1: vGPU (4GB, 1/4 GPU compute)
├─ VM 2: vGPU (4GB, 1/4 GPU compute)
├─ VM 3: vGPU (8GB, 1/2 GPU compute)
└─ Physical GPU (16GB total)

Each vGPU appears as a complete GPU to the guest OS, enabling standard CUDA applications without modification. Can use the Kata containers to enable vGPU on the Kubernetes.

Note: In order to use vGPU, vGPU requires NVIDIA vGPU license

Comparison Matrix

Technology	Isolation	Memory	Performance	Flexibility	Use Case
Full GPU	Hardware	Dedicated	100%	Low	Training, HPC
MIG	Hardware	Dedicated	Guaranteed	Medium	Inference, Multi-tenant
Time-Slicing	None	Shared	Variable	High	Dev/Test, Jupyter
vGPU	Software	Isolated	Good	High	VDI, Cloud VMs

The Container Device Interface (CDI) Revolution

In 2023-2024, the container ecosystem began transitioning to the Container Device Interface (CDI) — a standardized specification that fundamentally changes how devices are exposed to containers.

The Problem CDI Solves

The Old Way: Vendor-Specific Runtime Hooks

Before CDI, each hardware vendor needed custom integration:

┌─────────────────────────────────────────┐
│   Container Runtime (containerd)        │
└─────────────┬───────────────────────────┘
              │
              ↓
┌─────────────────────────────────────────┐
│   nvidia-container-runtime (wrapper)    │  ← NVIDIA-specific
└─────────────┬───────────────────────────┘
              │
              ↓
┌─────────────────────────────────────────┐
│   nvidia-container-runtime-hook         │  ← Vendor logic
└─────────────┬───────────────────────────┘
              │
              ↓
┌─────────────────────────────────────────┐
│   nvidia-container-cli                  │  ← Device provisioning
└─────────────────────────────────────────┘

Problems:

Vendor Lock-in: AMD needed rocm-container-runtime, Intel their own Runtime Coupling: Required wrapping or modifying the container runtime Complex Integration: Each vendor’s device plugin needed runtime-specific knowledge No Standardization: Every vendor solved the problem differently

The New Way: Declarative Device Specifications

Instead of runtime hooks, CDI uses a static JSON/YAML file on each node (generated once by the vendor tool) that declaratively describes everything a runtime needs to inject a device into a container: device nodes, library mounts, environment variables, and hooks.

The container runtime reads this file at container creation time and applies the edits directly to the OCI spec — no vendor wrapper required.

CDI Architecture

┌──────────────────────────────────────────┐
│   Container Orchestrator                 │
│   (Kubernetes, Podman, Docker)           │
└─────────────┬────────────────────────────┘
              │ Request: "nvidia.com/gpu=0"
              ↓
┌──────────────────────────────────────────┐
│   Container Runtime                      │
│   (containerd, CRI-O, Docker)            │
│   + Native CDI Support                   │
└─────────────┬────────────────────────────┘
              │ Reads CDI specs from disk
              ↓
┌──────────────────────────────────────────┐
│   CDI Specification Files                │
│   /etc/cdi/*.yaml                        │
│   /var/run/cdi/*.json                    │
└─────────────┬────────────────────────────┘
              │ Describes device configuration
              ↓
┌──────────────────────────────────────────┐
│   Host System Resources                  │
│   - Device nodes (/dev/nvidia*)          │
│   - Libraries (libcuda.so, etc.)         │
│   - Utilities (nvidia-smi)               │
└──────────────────────────────────────────┘

A CDI spec file (/etc/cdi/nvidia.yaml) is generated once by nvidia-ctk and contains three main sections:

# /etc/cdi/nvidia.yaml
cdiVersion: "0.6.0"          # CDI specification version
kind: nvidia.com/gpu          # Fully-qualified device kind (vendor.com/type)
                              # Prevents collisions: nvidia.com/gpu, amd.com/gpu, intel.com/gpu
devices:
  - name: "0"
    containerEdits:           # Everything to inject for this device
      deviceNodes:
        - path: /dev/nvidia0
          type: c
          major: 195
          minor: 0
        - path: /dev/nvidiactl
          type: c
          major: 195
          minor: 255
        - path: /dev/nvidia-uvm
          type: c
          major: 509
          minor: 0
      mounts:
        - hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.535.104.05
          containerPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
          options: ["ro", "nosuid", "nodev", "bind"]
        - hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.104.05
          containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
          options: ["ro", "nosuid", "nodev", "bind"]
        - hostPath: /usr/bin/nvidia-smi
          containerPath: /usr/bin/nvidia-smi
          options: ["ro", "nosuid", "nodev", "bind"]
      env:
        - "NVIDIA_VISIBLE_DEVICES=0"
        - "NVIDIA_DRIVER_CAPABILITIES=compute,utility"
      hooks:
        - hookName: createContainer
          path: /usr/bin/nvidia-ctk
          args: ["hook", "update-ldcache"]

  - name: "1"
    containerEdits:
      deviceNodes:
        - path: /dev/nvidia1
          type: c
          major: 195
          minor: 1
        - path: /dev/nvidiactl
          type: c
          major: 195
          minor: 255
        - path: /dev/nvidia-uvm
          type: c
          major: 509
          minor: 0
      mounts:
        # ... same libraries as device "0" ...
      env:
        - "NVIDIA_VISIBLE_DEVICES=1"
        - "NVIDIA_DRIVER_CAPABILITIES=compute,utility"

CDI vs Traditional Flow Comparison

Traditional NVIDIA Container Toolkit Flow

1. User runs container:
   docker run --gpus all nvidia/cuda
         ↓
2. Docker daemon calls nvidia-container-runtime
         ↓
3. nvidia-container-runtime wraps runc
         ↓
4. Prestart hook executes: nvidia-container-runtime-hook
         ↓
5. Hook reads --gpus flag and NVIDIA_VISIBLE_DEVICES
         ↓
6. nvidia-container-cli dynamically queries nvidia-smi
         ↓
7. Determines required devices, libraries, mounts
         ↓
8. Modifies OCI spec on-the-fly (adds devices, mounts, env)
         ↓
9. runc creates container with GPU access

Characteristics:

Dynamic device discovery at container start
Runtime wrapper required
Vendor-specific magic in environment variables
Black box: hard to inspect what’s being configured

CDI-Based Flow

1. One-time setup (on node):
   nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
         ↓
2. User runs container:
   docker run --device nvidia.com/gpu=0 nvidia/cuda
         ↓
3. containerd (with native CDI support) receives request
         ↓
4. Parses CDI device name: "nvidia.com/gpu=0"
         ↓
5. Looks up device in /etc/cdi/nvidia.yaml
         ↓
6. Reads containerEdits for device "0"
         ↓
7. Applies edits to OCI spec:
   - Adds device nodes
   - Adds mounts
   - Sets environment variables
   - Registers hooks
         ↓
8. runc creates container with GPU access

Characteristics:

Static device specification (generated once)
No runtime wrapper needed
Standard OCI runtime (runc) works unmodified
Transparent: inspect CDI specs to see exact configuration
Vendor provides only CDI spec generator

CDI in Kubernetes

Device Plugin is responsible to adher CDI

Pre-CDI Device Plugin

func (m *NvidiaDevicePlugin) Allocate(
    req *pluginapi.AllocateRequest,
) (*pluginapi.AllocateResponse, error) {
    responses := pluginapi.AllocateResponse{}
    
    for _, request := range req.ContainerRequests {
        // Device plugin must know HOW to provision GPU
        response := pluginapi.ContainerAllocateResponse{
            Envs: map[string]string{
                "NVIDIA_VISIBLE_DEVICES": "GPU-uuid-1234",
            },
            Mounts: []*pluginapi.Mount{
                {
                    HostPath: "/usr/lib/x86_64-linux-gnu/libcuda.so",
                    ContainerPath: "/usr/lib/x86_64-linux-gnu/libcuda.so",
                    ReadOnly: true,
                },
                // ... many more mounts ...
            },
            Devices: []*pluginapi.DeviceSpec{
                {
                    HostPath: "/dev/nvidia0",
                    ContainerPath: "/dev/nvidia0",
                    Permissions: "rwm",
                },
                {
                    HostPath: "/dev/nvidiactl",
                    ContainerPath: "/dev/nvidiactl",
                    Permissions: "rwm",
                },
                // ... more devices ...
            },
        }
        responses.ContainerResponses = append(
            responses.ContainerResponses, 
            &response,
        )
    }
    
    return &responses, nil
}

Post-CDI Device Plugin

func (m *NvidiaDevicePlugin) Allocate(
    req *pluginapi.AllocateRequest,
) (*pluginapi.AllocateResponse, error) {
    responses := pluginapi.AllocateResponse{}
    
    for _, request := range req.ContainerRequests {
        // Device plugin just returns CDI device names!
        var cdiDevices []string
        for _, deviceID := range request.DevicesIDs {
            cdiDevices = append(
                cdiDevices,
                fmt.Sprintf("nvidia.com/gpu=%s", deviceID),
            )
        }
        
        response := pluginapi.ContainerAllocateResponse{
            CDIDevices: cdiDevices,  // That's it!
        }
        responses.ContainerResponses = append(
            responses.ContainerResponses,
            &response,
        )
    }
    
    return &responses, nil
}

Key simplification: The device plugin no longer needs vendor-specific knowledge about mounts, device nodes, or environment variables. It simply returns CDI device identifiers.

Container Runtime Integration

When kubelet creates a container with CDI devices:

kubelet receives CDI device names from device plugin:
  ["nvidia.com/gpu=0", "nvidia.com/gpu=1"]
         ↓
kubelet adds CDI annotation to container config:
  annotations: {
    "cdi.k8s.io/devices": "nvidia.com/gpu=0,nvidia.com/gpu=1"
  }
         ↓
kubelet → containerd CRI: CreateContainer
         ↓
containerd reads CDI annotation
         ↓
containerd loads CDI registry from /etc/cdi/*.yaml
         ↓
For each CDI device:
  registry.GetDevice("nvidia.com/gpu=0")
  registry.GetDevice("nvidia.com/gpu=1")
         ↓
Applies container edits to OCI spec:
  - Merges all device nodes
  - Merges all mounts
  - Merges all environment variables
  - Collects all hooks
         ↓
Creates final OCI spec and calls runc

Generating CDI Specifications

NVIDIA Container Toolkit

# Basic generation
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# With custom options
nvidia-ctk cdi generate \
  --output=/etc/cdi/nvidia.yaml \
  --format=yaml \
  --device-name-strategy=index \
  --driver-root=/ \
  --nvidia-ctk-path=/usr/bin/nvidia-ctk \
  --ldcache-path=/etc/ld.so.cache

AMD ROCm

rocm-smi --showdriverversion
rocm-cdi-generator --output=/etc/cdi/amd.yaml

Dynamic Resource Allocation (DRA): Next-Generation GPU Scheduling

The Device Plugin framework (Layer 4) works well for simple whole-GPU assignment, but it has fundamental limitations when workloads need fine-grained control — specific MIG profiles, multi-node NVLink topology, shared resources, or per-claim lifecycle management. Kubernetes Dynamic Resource Allocation (DRA), stabilised in resource.k8s.io/v1 from Kubernetes 1.32, addresses these limitations by replacing the opaque device plugin gRPC API with a structured, declarative model visible to the scheduler.

The official DRA driver for NVIDIA GPUs is maintained at github.com/kubernetes-sigs/dra-driver-nvidia-gpu under the kubernetes-sigs organisation.

Why Device Plugin Falls Short

Limitation	Device Plugin Behaviour
Resource granularity	Allocates whole devices; MIG is bolted on via separate resource names
Topology awareness	Scheduler has no visibility into NVLink or NUMA topology
Shared resources	No first-class concept; time-slicing is a plugin-level workaround
Lifecycle	GPU bound to pod at creation; cannot be pre-allocated or shared across pods
Introspection	Allocation decisions are a black box to the control plane

DRA Core Concepts

DRA replaces the device plugin gRPC interface with three Kubernetes API objects.

ResourceSlice — Driver Advertises Devices

A DRA driver publishes ResourceSlice objects (one per node) instead of calling ListAndWatch(). Each slice describes the devices on that node with structured, queryable attributes:

apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: node-gpu-01-nvidia-gpus
spec:
  driver: gpu.nvidia.com
  pool:
    name: node-gpu-01
    resourceSliceCount: 1
  nodeName: node-gpu-01
  devices:
  - name: gpu-0
    basic:
      attributes:
        uuid:        { string: "GPU-a4f8c2d1-e5f6-7a8b-9c0d-1e2f3a4b5c6d" }
        model:       { string: "NVIDIA H100 SXM5 80GB" }
        profile:     { string: "3g.20gb" }        # populated for MIG slices
        parentUUID:  { string: "GPU-a4f8c2d1..." } # used for co-location constraints
      capacity:
        memory: 80Gi

DeviceClass — Cluster Policy for a Device Type

DeviceClass is a cluster-scoped object set by administrators. The NVIDIA DRA driver registers two device classes out of the box:

gpu.nvidia.com — whole GPU devices
mig.nvidia.com — MIG (Multi-Instance GPU) slices

ResourceClaim — User Requests Devices

Instead of resources.limits.nvidia.com/gpu: 1, a workload creates a ResourceClaim. The exactly: stanza specifies how many devices are required and optional CEL selectors:

# Two pods, each getting their own single GPU
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate       # per-pod claims for Jobs / Deployments
metadata:
  namespace: gpu-test1
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu-test1
  name: pod1
spec:
  resourceClaims:
  - name: gpu
    resourceClaimTemplateName: single-gpu
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpu
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

Two containers in the same pod can share one GPU claim by both referencing the same entry:

# One pod, two containers sharing one GPU
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test2
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu-test2
  name: shared-gpu-pod
spec:
  resourceClaims:
  - name: shared-gpu
    resourceClaimTemplateName: single-gpu
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu   # both containers reference the same claim
  - name: ctr1
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

NVIDIA DRA Driver (`dra-driver-nvidia-gpu`)

The driver is maintained at github.com/kubernetes-sigs/dra-driver-nvidia-gpu and ships two kubelet plugins:

Plugin	Status	Purpose
`gpu-kubelet-plugin`	Experimental	Whole-GPU and MIG device allocation
`compute-domain-kubelet-plugin`	Officially supported	Multi-Node NVLink / ComputeDomain orchestration

Architecture

┌────────────────────────────────────────────────────┐
│   kube-apiserver                                   │
│   ResourceSlice, ResourceClaim, DeviceClass        │
└─────────────┬──────────────────────────────────────┘
              │
              ↓
┌────────────────────────────────────────────────────┐
│   kube-scheduler (DRA-aware)                       │
│   Reads ResourceSlice attributes via CEL           │
│   Writes allocation into ResourceClaim.status      │
└─────────────┬──────────────────────────────────────┘
              │
              ↓
┌────────────────────────────────────────────────────┐
│   kubelet                                          │
│   Calls DRA plugin NodePrepareResources() gRPC     │
└─────────────┬──────────────────────────────────────┘
              │
              ↓
┌────────────────────────────────────────────────────────────┐
│  dra-driver-nvidia-gpu (DaemonSet on every GPU node)       │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  gpu-kubelet-plugin  (experimental)                 │   │
│  │  - NodePrepareResources / NodeUnprepareResources    │   │
│  │  - Writes CDI spec for the allocated GPU/MIG slice  │   │
│  └─────────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  compute-domain-kubelet-plugin  (supported)         │   │
│  │  - Orchestrates IMEX daemons, domains, channels     │   │
│  │  - Guarantees NVLink-reachability across nodes      │   │
│  └─────────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  controller (Deployment on control-plane)           │   │
│  │  - Publishes ResourceSlice objects per node         │   │
│  │  - Watches GPU inventory changes                    │   │
│  └─────────────────────────────────────────────────────┘   │
└────────────────┬───────────────────────────────────────────┘
                 │  CDI device name
                 ↓
┌────────────────────────────────────────────────────┐
│   containerd (CDI-aware)                           │
│   Reads CDI spec, injects devices/libs/env         │
└────────────────────────────────────────────────────┘

Installing via Helm

The chart image is served from registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu. GPU allocation is gated behind gpuResourcesEnabledOverride=true because it is still experimental.

helm upgrade -i \
  --create-namespace \
  --namespace dra-driver-nvidia-gpu \
  dra-driver-nvidia-gpu \
  oci://registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu \
  --set gpuResourcesEnabledOverride=true \
  --wait

# Verify — each GPU node should show a 2-container pod
kubectl -n dra-driver-nvidia-gpu get pods
# NAME                                  READY   STATUS
# dra-driver-nvidia-gpu-node-xxxxx      2/2     Running

Requires Kubernetes 1.32+ with the DynamicResourceAllocation feature gate enabled.

MIG Allocation via DRA

DRA makes MIG allocation first-class. The mig.nvidia.com DeviceClass exposes individual MIG slices as devices in ResourceSlice. CEL selectors on the profile attribute replace the separate nvidia.com/mig-3g.20gb resource names used by the device plugin.

The matchAttribute constraint ensures all requested slices come from the same physical GPU:

# One pod, 4 containers — each getting a different MIG slice from the same A100
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test4
  name: mig-devices
spec:
  spec:
    devices:
      requests:
      - name: mig-1g-5gb-0
        exactly:
          deviceClassName: mig.nvidia.com
          selectors:
          - cel:
              expression: "device.attributes['gpu.nvidia.com'].profile == '1g.5gb'"
      - name: mig-1g-5gb-1
        exactly:
          deviceClassName: mig.nvidia.com
          selectors:
          - cel:
              expression: "device.attributes['gpu.nvidia.com'].profile == '1g.5gb'"
      - name: mig-2g-10gb
        exactly:
          deviceClassName: mig.nvidia.com
          selectors:
          - cel:
              expression: "device.attributes['gpu.nvidia.com'].profile == '2g.10gb'"
      - name: mig-3g-20gb
        exactly:
          deviceClassName: mig.nvidia.com
          selectors:
          - cel:
              expression: "device.attributes['gpu.nvidia.com'].profile == '3g.20gb'"
      constraints:
      - requests: []
        matchAttribute: "gpu.nvidia.com/parentUUID"  # all slices from one GPU

The driver handles MIG instance creation and teardown as part of the claim lifecycle — no manual nvidia-smi mig commands needed.

ComputeDomains — Multi-Node NVLink (Officially Supported)

A ComputeDomain is an abstraction for robust, secure Multi-Node NVLink connectivity. It guarantees NVLink-reachability between all pods in the domain and isolates them from external pods. The driver internally orchestrates IMEX (Inter-MIG Extended) daemons, domains, and channels.

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: compute-domain
spec:
  spec:
    devices:
      requests:
      - name: domain
        exactly:
          deviceClassName: computedomain.nvidia.com

Unlike the experimental GPU plugin, ComputeDomain support is officially maintained and production-ready.

DRA Scheduling Flow

User creates ResourceClaim (status: unallocated)
         ↓
kube-scheduler reads ResourceSlice objects from all nodes
         ↓
Evaluates CEL selectors against device attributes
         ↓
Scores and selects the best matching node
         ↓
Scheduler writes result into ResourceClaim.status.allocation:
  {
    "devices": { "results": [
      { "driver": "gpu.nvidia.com", "pool": "node-gpu-01", "device": "gpu-0", "request": "gpu" }
    ]}
  }
         ↓
kubelet on node-gpu-01 sees the bound claim
         ↓
kubelet calls: gpu-kubelet-plugin.NodePrepareResources(claimUID)
         ↓
Driver writes CDI spec for the allocated device
         ↓
kubelet passes CDI device name to containerd
         ↓
containerd applies CDI spec → container starts with GPU access

The key difference from the device plugin flow: the scheduler has full visibility into device attributes and makes the allocation decision, rather than the plugin deciding inside an opaque gRPC call at pod start.

DRA vs Device Plugin Comparison

Aspect	Device Plugin	DRA Driver
Resource discovery	gRPC `ListAndWatch()`	`ResourceSlice` Kubernetes objects
Resource request	`resources.limits`	`ResourceClaim` / `ResourceClaimTemplate`
Scheduler visibility	Opaque count only	Full attributes queryable via CEL
Allocation decision	Plugin at pod start	Scheduler at scheduling time
MIG support	Separate resource names per profile	CEL selectors on `profile` attribute
Multi-node NVLink	Not supported	ComputeDomain plugin (officially supported)
Shared GPU between containers	Not supported	Supported via shared `ResourceClaim`
Kubernetes version	Stable since 1.10	GA (`v1`) from Kubernetes 1.32

Conclusion

Lets summerize the GPU Container Enablement Flow

Architecture Components

Kubernetes Scheduler - Selects nodes with GPU resources
NVIDIA Device Plugin - Discovers and advertises GPU devices (traditional path)
NVIDIA DRA Driver - Publishes ResourceSlice objects and prepares devices (modern path)
Kubelet - Manages pod lifecycle
Container Runtime (containerd) - Creates containers
NVIDIA Container Toolkit / CDI - Provides GPU access hooks and declarative device specs
GPU Hardware Layer - Physical NVIDIA GPUs and drivers

Device Plugin Flow

DRA Flow

Key Components

GPU Device Plugin (Traditional Path)

Discovers GPU resources and advertises to Kubernetes via gRPC ListAndWatch
Manages GPU allocation to pods (DaemonSet)

NVIDIA DRA Driver (Modern Path) — `kubernetes-sigs/dra-driver-nvidia-gpu`

Publishes structured ResourceSlice objects describing each GPU’s attributes (gpu.nvidia.com) and MIG slices (mig.nvidia.com)
Implements NodePrepareResources so kubelet can activate allocated devices via CDI
gpu-kubelet-plugin (experimental): CEL-based GPU/MIG selection and lifecycle management
compute-domain-kubelet-plugin (supported): Multi-Node NVLink / ComputeDomain orchestration
Requires Kubernetes 1.32+ with DynamicResourceAllocation feature gate

Kubelet

Node agent managing pod lifecycle
Communicates with device plugins (traditional) or DRA driver plugins (DRA path) and container runtime

Container Runtime (containerd)

Creates containers and integrates with NVIDIA Container Toolkit or CDI
Mounts GPU devices into containers

NVIDIA Container Toolkit / CDI

Runtime hook for GPU container creation (legacy path)
CDI: declarative JSON/YAML specs for vendor-neutral device injection (modern path used by DRA driver)