From Silicon to Container: The Complete Journey of GPU Provisioning in Kubernetes

A deep dive into how Kubernetes makes GPUs accessible to containers, from bare metal to CUDA applications

Introduction

When you request a GPU in a Kubernetes pod with a simple nvidia.com/gpu: 1 resource limit, an intricate dance of kernel drivers, container runtimes, device plugins, and orchestration layers springs into action. This journey from physical hardware to a running CUDA application involves multiple abstraction layers working in concert.

In this comprehensive guide, we’ll explore every layer of this stack—from PCIe device files to the emerging Container Device Interface (CDI) standard—revealing the elegant complexity that makes GPU-accelerated containerized workloads possible.

Layer 1: Hardware & Kernel Foundation
Layer 2: Container Runtime GPU Access
Layer 3: CUDA in Containers
Layer 4: Kubernetes GPU Scheduling
Layer 5: GPU Isolation & Visibility
Layer 6: Complete Flow Example
Layer 7: Advanced GPU Sharing
The Container Device Interface (CDI) Revolution
Conclusion

Note: Dynamic Resoure Allocation(DRA) is left out of the the scope, we have covered Device Plugin aspects of it.

Layer 1: Hardware & Kernel Foundation

Physical GPU Access

At the most fundamental level, a GPU is a PCIe device connected to the host system. The Linux kernel communicates with it through a sophisticated driver stack.

GPU Driver Architecture

The NVIDIA driver (similar concepts apply to AMD and Intel) consists of several kernel modules:

nvidia.ko              # Core driver module
nvidia-uvm.ko          # Unified Memory module
nvidia-modeset.ko      # Display mode setting
nvidia-drm.ko          # Direct Rendering Manager

When loaded, these modules create device files in /dev/:

/dev/nvidia0           # First GPU device
/dev/nvidia1           # Second GPU device
/dev/nvidiactl         # Control device for driver management
/dev/nvidia-uvm        # Unified Virtual Memory device
/dev/nvidia-uvm-tools  # UVM debugging and profiling
/dev/nvidia-modeset    # Mode setting operations

These character devices provide the fundamental interface between userspace applications and GPU hardware.

Device File Permissions

Device files have specific ownership and permissions:

$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Oct 23 09:00 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Oct 23 09:00 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Oct 23 09:00 /dev/nvidiactl
crw-rw-rw- 1 root root 509,   0 Oct 23 09:00 /dev/nvidia-uvm

The major numbers (195 for nvidia devices, 509 for UVM) are registered with the Linux kernel and used by the device controller to route operations to the correct driver.

Layer 2: Container Runtime GPU Access

The Container Isolation Challenge

Containers use Linux namespaces to create isolated environments. By default, a container cannot access the host’s GPU devices because:

Device namespace isolation: Container has its own /dev filesystem
cgroups device controller: Restricts which devices a process can access
Mount namespace: Container filesystem doesn’t include host device files

NVIDIA Container Toolkit: Bridging the Gap

The NVIDIA Container Toolkit (formerly nvidia-docker2) solves this problem by modifying the container creation process.

Component Architecture

┌─────────────────────────────────────────┐
│   Container Runtime (Docker/containerd) │
└──────────────┬──────────────────────────┘
               │
               ↓
┌─────────────────────────────────────────┐
│   nvidia-container-runtime               │
│   (OCI-compliant runtime wrapper)        │
└──────────────┬──────────────────────────┘
               │
               ↓
┌─────────────────────────────────────────┐
│   nvidia-container-runtime-hook          │
│   (Prestart hook)                        │
└──────────────┬──────────────────────────┘
               │
               ↓
┌─────────────────────────────────────────┐
│   nvidia-container-cli                   │
│   (Performs actual GPU provisioning)     │
└──────────────────────────────────────────┘

What Gets Mounted Into the Container

When a container requests GPU access, the NVIDIA Container Toolkit mounts:

Device Files:

/dev/nvidia0              # GPU device
/dev/nvidia1              # Additional GPUs
/dev/nvidiactl            # Control device
/dev/nvidia-uvm           # Unified Memory device
/dev/nvidia-uvm-tools     # UVM tools
/dev/nvidia-modeset       # Mode setting

Driver Libraries (from host):

/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.104.05
/usr/lib/x86_64-linux-gnu/libcuda.so.535.104.05
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.535.104.05
# ... and many more

Utilities:

/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced

cgroups Device Permissions

The toolkit also configures cgroups to allow device access:

# In the container's cgroup
devices.allow: c 195:* rwm    # Allow all NVIDIA devices (major 195)
devices.allow: c 195:255 rwm  # Allow nvidiactl
devices.allow: c 509:* rwm    # Allow nvidia-uvm devices (major 509)

The format c 195:* rwm means:

c: Character device
195: Major number (NVIDIA devices)
*: All minor numbers (all GPUs)
rwm: Read, write, and mknod permissions

Layer 3: CUDA in Containers

Understanding the CUDA Stack

CUDA applications communicate with GPUs through a layered software stack:

┌──────────────────────────────┐
│   Your CUDA Application      │
│   (compiled with nvcc)       │
└─────────────┬────────────────┘
              │
              ↓
┌──────────────────────────────┐
│   CUDA Runtime API           │
│   (libcudart.so)             │
│   - cudaMalloc()             │
│   - cudaMemcpy()             │
│   - kernel<<<>>>()           │
└─────────────┬────────────────┘
              │
              ↓
┌──────────────────────────────┐
│   CUDA Driver API            │
│   (libcuda.so)               │
│   - cuMemAlloc()             │
│   - cuLaunchKernel()         │
└─────────────┬────────────────┘
              │
              ↓
┌──────────────────────────────┐
│   Kernel Driver              │
│   (nvidia.ko)                │
└─────────────┬────────────────┘
              │
              ↓
┌──────────────────────────────┐
│   Physical GPU Hardware      │
└──────────────────────────────┘

CUDA in a Containerized Environment

When you run a CUDA application inside a container, the call stack looks like:

[Container] Your CUDA Application
                ↓
[Container] libcudart.so (CUDA Runtime)
                ↓
[Mounted from Host] libcuda.so (CUDA Driver Library)
                ↓
[ioctl() system calls]
                ↓
[Mounted Device] /dev/nvidia0
                ↓
[Host Kernel] nvidia.ko driver
                ↓
[Physical Hardware] GPU

The Critical Driver Compatibility Requirement

Key Point: The libcuda.so driver library version must match the host kernel driver version. This is why we mount the driver library from the host rather than packaging it in the container image.

Example compatibility matrix:

Host Driver Version    Compatible CUDA Toolkit Versions
-------------------    --------------------------------
535.104.05            CUDA 11.0 - 12.2
525.85.12             CUDA 11.0 - 12.1
515.65.01             CUDA 11.0 - 11.8

The CUDA toolkit in your container must be compatible with the host’s driver version, but it doesn’t need to match exactly—newer drivers support older CUDA toolkits.

A Simple CUDA Example

Here’s what happens when you run a basic CUDA program:

#include <cuda_runtime.h>
#include <stdio.h>

int main() {
    float *d_data;
    size_t size = 1024 * sizeof(float);
    
    // This triggers the entire stack
    cudaError_t err = cudaMalloc(&d_data, size);
    
    if (err == cudaSuccess) {
        printf("Successfully allocated %zu bytes on GPU\n", size);
        cudaFree(d_data);
    }
    
    return 0;
}

Behind the scenes:

cudaMalloc() calls cuMemAlloc() in libcuda.so
libcuda.so opens /dev/nvidia0
Issues ioctl() system call with NVIDIA_IOCTL_ALLOC_MEM
Kernel driver nvidia.ko receives the request
Driver checks cgroups: “Is this process allowed to access device 195:0?”
If allowed, driver allocates GPU memory
Returns device memory pointer to application

Layer 4: Kubernetes GPU Scheduling

The Device Plugin Framework

Kubernetes uses an extensible Device Plugin system to manage specialized hardware like GPUs, FPGAs, and InfiniBand adapters.

Architecture Overview

┌────────────────────────────────────────┐
│   kube-apiserver                       │
│   (Node status: nvidia.com/gpu: 4)    │
└───────────────┬────────────────────────┘
                │
                ↓
┌────────────────────────────────────────┐
│   kube-scheduler                       │
│   (Finds nodes with requested GPUs)   │
└───────────────┬────────────────────────┘
                │
                ↓
┌────────────────────────────────────────┐
│   kubelet (on GPU node)                │
│   - Discovers device plugins           │
│   - Tracks GPU allocation              │
│   - Calls Allocate() for pods          │
└───────────────┬────────────────────────┘
                │
                ↓
┌────────────────────────────────────────┐
│   NVIDIA Device Plugin (DaemonSet)     │
│   - Discovers GPUs (nvidia-smi)        │
│   - Registers with kubelet             │
│   - Allocates GPUs to containers       │
└────────────────────────────────────────┘

Device Plugin Discovery and Registration

The NVIDIA Device Plugin runs as a DaemonSet on every GPU node:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    spec:
      containers:
      - name: nvidia-device-plugin
        image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins

The Registration Process

Device Plugin Starts

nvidia-device-plugin container starts
           ↓
Queries GPUs: nvidia-smi --query-gpu=uuid --format=csv
           ↓
Discovers: GPU-a4f8c2d1, GPU-b3e9d4f2, GPU-c8f1a5b3, GPU-d2c7e9a4

Registration with kubelet

Device plugin connects to: unix:///var/lib/kubelet/device-plugins/kubelet.sock
           ↓
Sends Register() gRPC call:
{
  "version": "v1beta1",
  "endpoint": "nvidia.sock",
  "resourceName": "nvidia.com/gpu"
}

Advertising Resources

kubelet calls ListAndWatch() on device plugin
           ↓
Device plugin responds:
{
  "devices": [
    {"id": "GPU-a4f8c2d1", "health": "Healthy"},
    {"id": "GPU-b3e9d4f2", "health": "Healthy"},
    {"id": "GPU-c8f1a5b3", "health": "Healthy"},
    {"id": "GPU-d2c7e9a4", "health": "Healthy"}
  ]
}
           ↓
kubelet updates node status:
status.capacity.nvidia.com/gpu: "4"
status.allocatable.nvidia.com/gpu: "4"

Pod Scheduling Flow

Let’s trace a complete pod scheduling workflow:

Step 1: User Creates Pod

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.8.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 2  # Request 2 GPUs

Step 2: Scheduler Filters and Scores

kube-scheduler receives unscheduled pod
         ↓
Filtering Phase:
  - node-1: cpu OK, memory OK, nvidia.com/gpu=0 ✗ (no GPUs)
  - node-2: cpu OK, memory OK, nvidia.com/gpu=2 ✓
  - node-3: cpu OK, memory OK, nvidia.com/gpu=4 ✓
  - node-4: cpu ✗ (insufficient CPU)
         ↓
Scoring Phase:
  - node-2: score 85 (2 GPUs available, high utilization)
  - node-3: score 92 (4 GPUs available, moderate utilization)
         ↓
Selected: node-3
         ↓
Binding: pod assigned to node-3

Step 3: kubelet Allocates GPUs

kubelet on node-3 receives pod assignment
         ↓
For container "cuda-container" requesting 2 GPUs:
         ↓
kubelet calls: DevicePlugin.Allocate(deviceIds=["GPU-a4f8c2d1", "GPU-b3e9d4f2"])
         ↓
Device plugin responds:
{
  "containerResponses": [{
    "envs": {
      "NVIDIA_VISIBLE_DEVICES": "GPU-a4f8c2d1,GPU-b3e9d4f2"
    },
    "mounts": [{
      "hostPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.535.104.05",
      "containerPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.1",
      "readOnly": true
    }],
    "devices": [{
      "hostPath": "/dev/nvidia0",
      "containerPath": "/dev/nvidia0",
      "permissions": "rwm"
    }, {
      "hostPath": "/dev/nvidia1",
      "containerPath": "/dev/nvidia1",
      "permissions": "rwm"
    }]
  }]
}

Step 4: Container Runtime Provisions GPU

kubelet → containerd: CreateContainer with:
  - Environment: NVIDIA_VISIBLE_DEVICES=GPU-a4f8c2d1,GPU-b3e9d4f2
  - Mounts: driver libraries
  - Devices: /dev/nvidia0, /dev/nvidia1
         ↓
containerd calls: nvidia-container-runtime-hook (prestart)
         ↓
Hook configures:
  - Mounts all required device files
  - Mounts NVIDIA libraries
  - Sets up cgroups device controller
  - Configures environment variables
         ↓
Container starts with GPU access
         ↓
nvidia-smi inside container shows 2 GPUs

Layer 5: GPU Isolation & Visibility

The Magic of NVIDIA_VISIBLE_DEVICES

The NVIDIA_VISIBLE_DEVICES environment variable is the key to GPU isolation in containers. It controls which GPUs are visible to CUDA applications.

How It Works

Consider a host with 4 GPUs:

# On the host
$ nvidia-smi --query-gpu=index,uuid --format=csv
index, uuid
0, GPU-a4f8c2d1-e5f6-7a8b-9c0d-1e2f3a4b5c6d
1, GPU-b3e9d4f2-f6a7-8b9c-0d1e-2f3a4b5c6d7e
2, GPU-c8f1a5b3-a7b8-9c0d-1e2f-3a4b5c6d7e8f
3, GPU-d2c7e9a4-b8c9-0d1e-2f3a-4b5c6d7e8f9a

Container 1 configuration:

NVIDIA_VISIBLE_DEVICES=GPU-a4f8c2d1-e5f6-7a8b-9c0d-1e2f3a4b5c6d

# Inside container 1
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05   Driver Version: 535.104.05   CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
+-------------------------------+----------------------+----------------------+

Container 2 configuration:

NVIDIA_VISIBLE_DEVICES=GPU-b3e9d4f2-f6a7-8b9c-0d1e-2f3a4b5c6d7e,GPU-c8f1a5b3-a7b8-9c0d-1e2f-3a4b5c6d7e8f

# Inside container 2
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05   Driver Version: 535.104.05   CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1F.0 Off |                    0 |
|   1  Tesla V100-SXM2...  Off  | 00000000:00:20.0 Off |                    0 |
+-------------------------------+----------------------+----------------------+

Notice that:

Container 1 sees only 1 GPU (renumbered as GPU 0)
Container 2 sees 2 GPUs (renumbered as GPU 0 and 1)
Each container has its own isolated GPU namespace

Driver-Level Enforcement

When a CUDA application initializes:

cudaError_t err = cudaSetDevice(0);

The CUDA driver:

Reads NVIDIA_VISIBLE_DEVICES environment variable
Creates a virtual-to-physical GPU mapping

Only allows access to visible devices

cuInit() {
 visible_devices = getenv("NVIDIA_VISIBLE_DEVICES");
    
 if (visible_devices) {
     parse_and_filter_devices(visible_devices);
     // User's "GPU 0" maps to physical GPU as specified
 }
}

cgroups: Kernel-Level Protection

Environment variables provide application-level isolation, but cgroups enforce it at the kernel level.

For each container, cgroups device controller is configured:

Container 1:

# /sys/fs/cgroup/devices/kubepods/pod<uid>/<container-id>/devices.list
c 195:0 rwm      # Allow /dev/nvidia0 only
c 195:255 rwm    # Allow /dev/nvidiactl
c 509:0 rwm      # Allow /dev/nvidia-uvm

# Implicit deny for:
# c 195:1 (would be /dev/nvidia1)
# c 195:2 (would be /dev/nvidia2)
# c 195:3 (would be /dev/nvidia3)

Even if a malicious process inside Container 1 tries to open /dev/nvidia1, the kernel blocks it:

// Malicious code attempt
int fd = open("/dev/nvidia1", O_RDWR);
// Returns: -1 (EPERM - Operation not permitted)
// Kernel: cgroups device controller denied access

This provides defense-in-depth: both application-level (CUDA driver) and kernel-level (cgroups) isolation.

Layer 6: Complete Flow Example

Let’s trace a complete end-to-end flow from pod creation to CUDA memory allocation.

The Scenario

We’ll deploy a pod requesting 2 GPUs and run a simple CUDA program that allocates GPU memory.

Step 1: Deploy the Pod

apiVersion: v1
kind: Pod
metadata:
  name: cuda-mem-test
spec:
  restartPolicy: Never
  containers:
  - name: cuda-app
    image: nvidia/cuda:11.8.0-devel-ubuntu22.04
    command: ["./cuda_malloc_test"]
    resources:
      limits:
        nvidia.com/gpu: 2

$ kubectl apply -f cuda-pod.yaml
pod/cuda-mem-test created

Step 2: Scheduler Assignment

kube-scheduler watches for unscheduled pods
         ↓
Finds cuda-mem-test pod (status.phase: Pending)
         ↓
Queries all nodes for available resources:
  node-gpu-01: nvidia.com/gpu available: 0/4 (fully allocated)
  node-gpu-02: nvidia.com/gpu available: 2/4 ✓
  node-gpu-03: nvidia.com/gpu available: 4/4 ✓
         ↓
Applies scoring algorithms:
  node-gpu-02: score 75 (50% GPU utilization)
  node-gpu-03: score 90 (0% GPU utilization, better choice)
         ↓
Selects node-gpu-03
         ↓
Creates binding: pod cuda-mem-test → node-gpu-03
         ↓
Updates pod: status.nodeName: node-gpu-03

Step 3: kubelet Provisions Container

kubelet on node-gpu-03 receives pod assignment
         ↓
Examines resource requests: nvidia.com/gpu: 2
         ↓
Calls device plugin's Allocate() via gRPC:
{
  "containerRequests": [{
    "devicesIDs": ["GPU-uuid-1234", "GPU-uuid-5678"]
  }]
}
         ↓
Device plugin responds:
{
  "containerResponses": [{
    "devices": [
      {"hostPath": "/dev/nvidia0", "containerPath": "/dev/nvidia0"},
      {"hostPath": "/dev/nvidia1", "containerPath": "/dev/nvidia1"},
      {"hostPath": "/dev/nvidiactl", "containerPath": "/dev/nvidiactl"},
      {"hostPath": "/dev/nvidia-uvm", "containerPath": "/dev/nvidia-uvm"}
    ],
    "mounts": [
      {
        "hostPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.535.104.05",
        "containerPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.1"
      },
      {
        "hostPath": "/usr/bin/nvidia-smi",
        "containerPath": "/usr/bin/nvidia-smi"
      }
      // ... more libraries
    ],
    "envs": {
      "NVIDIA_VISIBLE_DEVICES": "GPU-uuid-1234,GPU-uuid-5678",
      "NVIDIA_DRIVER_CAPABILITIES": "compute,utility"
    }
  }]
}

Step 4: Container Runtime Configuration

kubelet → containerd CRI: CreateContainer
         ↓
containerd creates OCI spec:
{
  "linux": {
    "devices": [
      {"path": "/dev/nvidia0", "type": "c", "major": 195, "minor": 0},
      {"path": "/dev/nvidia1", "type": "c", "major": 195, "minor": 1},
      {"path": "/dev/nvidiactl", "type": "c", "major": 195, "minor": 255},
      {"path": "/dev/nvidia-uvm", "type": "c", "major": 509, "minor": 0}
    ],
    "resources": {
      "devices": [
        {"allow": false, "access": "rwm"},  // Deny all by default
        {"allow": true, "type": "c", "major": 195, "minor": 0, "access": "rwm"},
        {"allow": true, "type": "c", "major": 195, "minor": 1, "access": "rwm"},
        {"allow": true, "type": "c", "major": 195, "minor": 255, "access": "rwm"},
        {"allow": true, "type": "c", "major": 509, "minor": 0, "access": "rwm"}
      ]
    }
  },
  "mounts": [...],
  "process": {
    "env": [
      "NVIDIA_VISIBLE_DEVICES=GPU-uuid-1234,GPU-uuid-5678",
      "NVIDIA_DRIVER_CAPABILITIES=compute,utility"
    ]
  }
}
         ↓
containerd calls runc with nvidia-container-runtime-hook
         ↓
Hook performs final configuration and mounts
         ↓
Container starts

Step 5: CUDA Application Runs

Inside the container, our CUDA application executes:

#include <cuda_runtime.h>
#include <stdio.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("Visible GPUs: %d\n", deviceCount);
    
    for (int i = 0; i < deviceCount; i++) {
        cudaSetDevice(i);
        
        float *d_data;
        size_t size = 1024 * 1024 * 1024;  // 1 GB
        
        cudaError_t err = cudaMalloc(&d_data, size);
        if (err == cudaSuccess) {
            printf("GPU %d: Allocated 1 GB\n", i);
            cudaFree(d_data);
        }
    }
    
    return 0;
}

The execution flow:

Application calls: cudaGetDeviceCount(&deviceCount)
         ↓
CUDA Runtime (libcudart.so): cuDeviceGetCount()
         ↓
CUDA Driver (libcuda.so):
  - Reads NVIDIA_VISIBLE_DEVICES from environment
  - Parses: "GPU-uuid-1234,GPU-uuid-5678"
  - Returns: deviceCount = 2
         ↓
Application prints: "Visible GPUs: 2"
         ↓
Application calls: cudaMalloc(&d_data, 1GB) for GPU 0
         ↓
CUDA Runtime: cuMemAlloc(1073741824)  // 1 GB in bytes
         ↓
CUDA Driver:
  - Determines physical GPU from NVIDIA_VISIBLE_DEVICES mapping
  - Virtual GPU 0 → Physical GPU-uuid-1234 → /dev/nvidia0
  - Opens file descriptor: fd = open("/dev/nvidia0", O_RDWR)
         ↓
Kernel checks cgroups:
  - Process in cgroup: /kubepods/pod-xyz/container-abc
  - Requested device: major=195, minor=0
  - cgroups device allowlist: c 195:0 rwm ✓ ALLOWED
         ↓
Kernel forwards to nvidia.ko driver
         ↓
nvidia.ko driver:
  - Allocates 1 GB of GPU memory on physical GPU
  - Programs GPU memory controller
  - Returns device memory address: 0x7f8c40000000
         ↓
CUDA Driver returns to application
         ↓
Application prints: "GPU 0: Allocated 1 GB"
         ↓
[Repeat for GPU 1 with /dev/nvidia1]
         ↓
Application prints: "GPU 1: Allocated 1 GB"

System calls involved:

# Traced with strace
openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = 3
ioctl(3, NVIDIA_IOC_QUERY_DEVICE_CLASS, ...) = 0
ioctl(3, NVIDIA_IOC_CARD_INFO, ...) = 0
ioctl(3, NVIDIA_IOC_ALLOC_MEM, {size=1073741824, ...}) = 0
# ... GPU memory now allocated ...
ioctl(3, NVIDIA_IOC_FREE_MEM, ...) = 0
close(3) = 0

Modern GPU workloads often don’t need an entire GPU. Several technologies enable GPU sharing:

Multi-Instance GPU (MIG)

NVIDIA A100 and H100 GPUs support hardware-level partitioning into Multiple Instances.

MIG Architecture

A single A100 GPU can be divided into up to 7 instances:

Physical A100 (40GB)
├─ MIG Instance 0: 3g.20gb (3 compute slices, 20GB memory)
├─ MIG Instance 1: 3g.20gb (3 compute slices, 20GB memory)
├─ MIG Instance 2: 2g.10gb (2 compute slices, 10GB memory)
└─ MIG Instance 3: 1g.5gb  (1 compute slice, 5GB memory)

Each MIG instance:

Has dedicated compute resources (streaming multiprocessors)
Has dedicated memory partition
Provides hardware-level isolation
Appears as a separate GPU device

MIG Device Files

# Enable MIG mode
$ nvidia-smi -i 0 -mig 1

# Create MIG instances
$ nvidia-smi mig -cgi 3g.20gb -C
$ nvidia-smi mig -cgi 3g.20gb -C
$ nvidia-smi mig -cgi 1g.5gb -C

# New device files appear

$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Oct 23 09:00 /dev/nvidia0          # Parent GPU
crw-rw-rw- 1 root root 195, 255 Oct 23 09:00 /dev/nvidiactl
crw-rw-rw- 1 root root 509,   0 Oct 23 09:00 /dev/nvidia-uvm

# MIG device files
crw-rw-rw- 1 root root 195,   1 Oct 23 09:00 /dev/nvidia0mig0      # First 3g.20gb
crw-rw-rw- 1 root root 195,   2 Oct 23 09:00 /dev/nvidia0mig1      # Second 3g.20gb
crw-rw-rw- 1 root root 195,   3 Oct 23 09:00 /dev/nvidia0mig2      # 1g.5gb

MIG in Kubernetes

The NVIDIA Device Plugin discovers MIG instances and advertises them as separate resources:

apiVersion: v1
kind: Node
status:
  capacity:
    nvidia.com/mig-3g.20gb: "2"
    nvidia.com/mig-1g.5gb: "1"
  allocatable:
    nvidia.com/mig-3g.20gb: "2"
    nvidia.com/mig-1g.5gb: "1"

#Pods can request specific MIG profiles:
apiVersion: v1
kind: Pod
metadata:
  name: mig-pod
spec:
  containers:
  - name: cuda-app
    image: nvidia/cuda:11.8.0-base-ubuntu22.04
    resources:
      limits:
        nvidia.com/mig-3g.20gb: 1  # Request one 3g.20gb instance

MIG benefits

True hardware isolation (unlike time-slicing)
Guaranteed memory allocation
Fault isolation (one instance failure doesn’t affect others)
Quality of Service (QoS) guarantees

MIG Trade-offs

Partiations the GPU in as per device capabilities, less control over GPU partitioning layout

GPU Time-Slicing

For workloads that don’t require full GPU utilization, time-slicing allows multiple containers to share a single GPU.

Device Plugin ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        replicas: 4
        renameByDefault: false
        failRequestsGreaterThanOne: true
    resources:
      - name: nvidia.com/gpu
        devices: all

With this configuration:

Each physical GPU appears as 4 schedulable resources
Kubernetes can schedule 4 pods per GPU
All pods access the same physical GPU

How Time-Slicing Works

Pod 1 Container
     ↓
NVIDIA_VISIBLE_DEVICES=GPU-0
     ↓
cudaMalloc() → /dev/nvidia0

Pod 2 Container
     ↓
NVIDIA_VISIBLE_DEVICES=GPU-0  # Same GPU!
     ↓
cudaMalloc() → /dev/nvidia0

Pod 3 Container
     ↓
NVIDIA_VISIBLE_DEVICES=GPU-0  # Same GPU!
     ↓
cudaMalloc() → /dev/nvidia0

All containers:

See the same GPU device
Create separate CUDA contexts
GPU hardware time-multiplexes between contexts
No memory isolation (pods can see each other’s allocations!)

Time-slicing characteristics:

Pros:
- Easy to configure
- Works with any GPU
- Higher utilization for bursty workloads
Cons:
- No memory isolation (security risk)
- No performance guarantees
- One container can starve others
- OOM on one container affects all

Best for:

Development/testing environments
Interactive workloads (Jupyter notebooks)
Bursty inference workloads with low duty cycle

vGPU (Virtual GPU)

NVIDIA vGPU technology provides software-defined GPU sharing with:

Hypervisor-level virtualization
Memory isolation between VMs
QoS policies and scheduling

Live migration support

Hypervisor (VMware vSphere / KVM)
├─ VM 1: vGPU (4GB, 1/4 GPU compute)
├─ VM 2: vGPU (4GB, 1/4 GPU compute)
├─ VM 3: vGPU (8GB, 1/2 GPU compute)
└─ Physical GPU (16GB total)

Each vGPU appears as a complete GPU to the guest OS, enabling standard CUDA applications without modification. Can use the Kata containers to enable vGPU on the Kubernetes.

Note: In order to use vGPU, vGPU requires NVIDIA vGPU license

Comparison Matrix

Technology	Isolation	Memory	Performance	Flexibility	Use Case
Full GPU	Hardware	Dedicated	100%	Low	Training, HPC
MIG	Hardware	Dedicated	Guaranteed	Medium	Inference, Multi-tenant
Time-Slicing	None	Shared	Variable	High	Dev/Test, Jupyter
vGPU	Software	Isolated	Good	High	VDI, Cloud VMs

The Container Device Interface (CDI) Revolution

In 2023-2024, the container ecosystem began transitioning to the Container Device Interface (CDI)—a standardized specification that fundamentally changes how devices are exposed to containers.

The Problem CDI Solves

The Old Way: Vendor-Specific Runtime Hooks

Before CDI, each hardware vendor needed custom integration:

┌─────────────────────────────────────────┐
│   Container Runtime (containerd)        │
└─────────────┬───────────────────────────┘
              │
              ↓
┌─────────────────────────────────────────┐
│   nvidia-container-runtime (wrapper)    │  ← NVIDIA-specific
└─────────────┬───────────────────────────┘
              │
              ↓
┌─────────────────────────────────────────┐
│   nvidia-container-runtime-hook         │  ← Vendor logic
└─────────────┬───────────────────────────┘
              │
              ↓
┌─────────────────────────────────────────┐
│   nvidia-container-cli                  │  ← Device provisioning
└─────────────────────────────────────────┘

Problems:

Vendor Lock-in: AMD needed rocm-container-runtime, Intel their own Runtime Coupling: Required wrapping or modifying the container runtime Complex Integration: Each vendor’s device plugin needed runtime-specific knowledge No Standardization: Every vendor solved the problem differently

The New Way: Declarative Device Specifications

cdiVersion: "0.6.0"
kind: nvidia.com/gpu
devices:
  - name: "0"
    containerEdits:
      deviceNodes:
        - path: /dev/nvidia0
          type: c
          major: 195
          minor: 0
        - path: /dev/nvidiactl
          type: c
          major: 195
          minor: 255
        - path: /dev/nvidia-uvm
          type: c
          major: 509
          minor: 0
      mounts:
        - hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.535.104.05
          containerPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
          options: ["ro", "nosuid", "nodev", "bind"]
        - hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.104.05
          containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
          options: ["ro", "nosuid", "nodev", "bind"]
        - hostPath: /usr/bin/nvidia-smi
          containerPath: /usr/bin/nvidia-smi
          options: ["ro", "nosuid", "nodev", "bind"]
      env:
        - "NVIDIA_VISIBLE_DEVICES=0"
        - "NVIDIA_DRIVER_CAPABILITIES=compute,utility"
      hooks:
        - hookName: createContainer
          path: /usr/bin/nvidia-ctk
          args: ["hook", "update-ldcache"]
          
  - name: "1"
    containerEdits:
      deviceNodes:
        - path: /dev/nvidia1
          type: c
          major: 195
          minor: 1
        - path: /dev/nvidiactl
          type: c
          major: 195
          minor: 255
        - path: /dev/nvidia-uvm
          type: c
          major: 509
          minor: 0
      mounts:
        # ... same libraries ...
      env:
        - "NVIDIA_VISIBLE_DEVICES=1"
        - "NVIDIA_DRIVER_CAPABILITIES=compute,utility"

CDI Architecture

┌──────────────────────────────────────────┐
│   Container Orchestrator                 │
│   (Kubernetes, Podman, Docker)           │
└─────────────┬────────────────────────────┘
              │ Request: "nvidia.com/gpu=0"
              ↓
┌──────────────────────────────────────────┐
│   Container Runtime                      │
│   (containerd, CRI-O, Docker)            │
│   + Native CDI Support                   │
└─────────────┬────────────────────────────┘
              │ Reads CDI specs from disk
              ↓
┌──────────────────────────────────────────┐
│   CDI Specification Files                │
│   /etc/cdi/*.yaml                        │
│   /var/run/cdi/*.json                    │
└─────────────┬────────────────────────────┘
              │ Describes device configuration
              ↓
┌──────────────────────────────────────────┐
│   Host System Resources                  │
│   - Device nodes (/dev/nvidia*)          │
│   - Libraries (libcuda.so, etc.)         │
│   - Utilities (nvidia-smi)               │
└──────────────────────────────────────────┘

CDI provides a vendor-neutral, declarative JSON/YAML specification:

# /etc/cdi/nvidia.yaml
cdiVersion: "0.6.0"
kind: nvidia.com/gpu
devices:
  - name: "0"
    containerEdits:
      deviceNodes:
        - path: /dev/nvidia0
          type: c
          major: 195
          minor: 0
        - path: /dev/nvidiactl
          type: c
          major: 195
          minor: 255
        - path: /dev/nvidia-uvm
          type: c
          major: 509
          minor: 0
      mounts:
        - hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.535.104.05
          containerPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
          options: ["ro", "nosuid", "nodev", "bind"]
        - hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.104.05
          containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
          options: ["ro", "nosuid", "nodev", "bind"]
        - hostPath: /usr/bin/nvidia-smi
          containerPath: /usr/bin/nvidia-smi
          options: ["ro", "nosuid", "nodev", "bind"]
      env:
        - "NVIDIA_VISIBLE_DEVICES=0"
        - "NVIDIA_DRIVER_CAPABILITIES=compute,utility"
      hooks:
        - hookName: createContainer
          path: /usr/bin/nvidia-ctk
          args: ["hook", "update-ldcache"]
          
  - name: "1"
    containerEdits:
      deviceNodes:
        - path: /dev/nvidia1
          type: c
          major: 195
          minor: 1
        - path: /dev/nvidiactl
          type: c
          major: 195
          minor: 255
        - path: /dev/nvidia-uvm
          type: c
          major: 509
          minor: 0
      mounts:
        # ... same libraries ...
      env:
        - "NVIDIA_VISIBLE_DEVICES=1"
        - "NVIDIA_DRIVER_CAPABILITIES=compute,utility"

CDI Architecture

┌──────────────────────────────────────────┐
│   Container Orchestrator                 │
│   (Kubernetes, Podman, Docker)           │
└─────────────┬────────────────────────────┘
              │ Request: "nvidia.com/gpu=0"
              ↓
┌──────────────────────────────────────────┐
│   Container Runtime                      │
│   (containerd, CRI-O, Docker)            │
│   + Native CDI Support                   │
└─────────────┬────────────────────────────┘
              │ Reads CDI specs from disk
              ↓
┌──────────────────────────────────────────┐
│   CDI Specification Files                │
│   /etc/cdi/*.yaml                        │
│   /var/run/cdi/*.json                    │
└─────────────┬────────────────────────────┘
              │ Describes device configuration
              ↓
┌──────────────────────────────────────────┐
│   Host System Resources                  │
│   - Device nodes (/dev/nvidia*)          │
│   - Libraries (libcuda.so, etc.)         │
│   - Utilities (nvidia-smi)               │
└──────────────────────────────────────────┘

CDI Specification Structure A CDI spec file contains three main sections:

Device Definitions
Container Edits
Metadata

cdiVersion: "0.6.0"          # CDI specification version
kind: nvidia.com/gpu          # Fully-qualified device kind
                              # Format: vendor.com/device-type

The kind follows a domain name pattern to prevent collisions:

nvidia.com/gpu
amd.com/gpu
intel.com/gpu
xilinx.com/fpga

CDI vs Traditional Flow Comparison

Traditional NVIDIA Container Toolkit Flow

1. User runs container:
   docker run --gpus all nvidia/cuda
         ↓
2. Docker daemon calls nvidia-container-runtime
         ↓
3. nvidia-container-runtime wraps runc
         ↓
4. Prestart hook executes: nvidia-container-runtime-hook
         ↓
5. Hook reads --gpus flag and NVIDIA_VISIBLE_DEVICES
         ↓
6. nvidia-container-cli dynamically queries nvidia-smi
         ↓
7. Determines required devices, libraries, mounts
         ↓
8. Modifies OCI spec on-the-fly (adds devices, mounts, env)
         ↓
9. runc creates container with GPU access

Characteristics:

Dynamic device discovery at container start
Runtime wrapper required
Vendor-specific magic in environment variables
Black box: hard to inspect what’s being configured

CDI-Based Flow

1. One-time setup (on node):
   nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
         ↓
2. User runs container:
   docker run --device nvidia.com/gpu=0 nvidia/cuda
         ↓
3. containerd (with native CDI support) receives request
         ↓
4. Parses CDI device name: "nvidia.com/gpu=0"
         ↓
5. Looks up device in /etc/cdi/nvidia.yaml
         ↓
6. Reads containerEdits for device "0"
         ↓
7. Applies edits to OCI spec:
   - Adds device nodes
   - Adds mounts
   - Sets environment variables
   - Registers hooks
         ↓
8. runc creates container with GPU access

Characteristics:

Static device specification (generated once)
No runtime wrapper needed
Standard OCI runtime (runc) works unmodified
Transparent: inspect CDI specs to see exact configuration
Vendor provides only CDI spec generator

CDI in Kubernetes

Device Plugin is responsible to adher CDI

Pre-CDI Device Plugin

func (m *NvidiaDevicePlugin) Allocate(
    req *pluginapi.AllocateRequest,
) (*pluginapi.AllocateResponse, error) {
    responses := pluginapi.AllocateResponse{}
    
    for _, request := range req.ContainerRequests {
        // Device plugin must know HOW to provision GPU
        response := pluginapi.ContainerAllocateResponse{
            Envs: map[string]string{
                "NVIDIA_VISIBLE_DEVICES": "GPU-uuid-1234",
            },
            Mounts: []*pluginapi.Mount{
                {
                    HostPath: "/usr/lib/x86_64-linux-gnu/libcuda.so",
                    ContainerPath: "/usr/lib/x86_64-linux-gnu/libcuda.so",
                    ReadOnly: true,
                },
                // ... many more mounts ...
            },
            Devices: []*pluginapi.DeviceSpec{
                {
                    HostPath: "/dev/nvidia0",
                    ContainerPath: "/dev/nvidia0",
                    Permissions: "rwm",
                },
                {
                    HostPath: "/dev/nvidiactl",
                    ContainerPath: "/dev/nvidiactl",
                    Permissions: "rwm",
                },
                // ... more devices ...
            },
        }
        responses.ContainerResponses = append(
            responses.ContainerResponses, 
            &response,
        )
    }
    
    return &responses, nil
}

Post-CDI Device Plugin

func (m *NvidiaDevicePlugin) Allocate(
    req *pluginapi.AllocateRequest,
) (*pluginapi.AllocateResponse, error) {
    responses := pluginapi.AllocateResponse{}
    
    for _, request := range req.ContainerRequests {
        // Device plugin just returns CDI device names!
        var cdiDevices []string
        for _, deviceID := range request.DevicesIDs {
            cdiDevices = append(
                cdiDevices,
                fmt.Sprintf("nvidia.com/gpu=%s", deviceID),
            )
        }
        
        response := pluginapi.ContainerAllocateResponse{
            CDIDevices: cdiDevices,  // That's it!
        }
        responses.ContainerResponses = append(
            responses.ContainerResponses,
            &response,
        )
    }
    
    return &responses, nil
}

Key simplification: The device plugin no longer needs vendor-specific knowledge about mounts, device nodes, or environment variables. It simply returns CDI device identifiers.

Container Runtime Integration

When kubelet creates a container with CDI devices:

kubelet receives CDI device names from device plugin:
  ["nvidia.com/gpu=0", "nvidia.com/gpu=1"]
         ↓
kubelet adds CDI annotation to container config:
  annotations: {
    "cdi.k8s.io/devices": "nvidia.com/gpu=0,nvidia.com/gpu=1"
  }
         ↓
kubelet → containerd CRI: CreateContainer
         ↓
containerd reads CDI annotation
         ↓
containerd loads CDI registry from /etc/cdi/*.yaml
         ↓
For each CDI device:
  registry.GetDevice("nvidia.com/gpu=0")
  registry.GetDevice("nvidia.com/gpu=1")
         ↓
Applies container edits to OCI spec:
  - Merges all device nodes
  - Merges all mounts
  - Merges all environment variables
  - Collects all hooks
         ↓
Creates final OCI spec and calls runc

Generating CDI Specifications

NVIDIA Container Toolkit

# Basic generation
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# With custom options
nvidia-ctk cdi generate \
  --output=/etc/cdi/nvidia.yaml \
  --format=yaml \
  --device-name-strategy=index \
  --driver-root=/ \
  --nvidia-ctk-path=/usr/bin/nvidia-ctk \
  --ldcache-path=/etc/ld.so.cache

AMD ROCm

rocm-smi --showdriverversion
rocm-cdi-generator --output=/etc/cdi/amd.yaml

Conclusion

Lets summerize the GPU Container Enablement Flow

Architecture Components

Kubernetes Scheduler - Selects nodes with GPU resources
NVIDIA Device Plugin - Discovers and advertises GPU devices
Kubelet - Manages pod lifecycle
Container Runtime (containerd) - Creates containers
NVIDIA Container Toolkit - Provides GPU access hooks
GPU Hardware Layer - Physical NVIDIA GPUs and drivers

Introduction

Table of Contents

Layer 1: Hardware & Kernel Foundation

Physical GPU Access

GPU Driver Architecture

Device File Permissions

Layer 2: Container Runtime GPU Access

The Container Isolation Challenge

NVIDIA Container Toolkit: Bridging the Gap

Component Architecture

What Gets Mounted Into the Container

cgroups Device Permissions

Layer 3: CUDA in Containers

Understanding the CUDA Stack

CUDA in a Containerized Environment

The Critical Driver Compatibility Requirement

A Simple CUDA Example

Layer 4: Kubernetes GPU Scheduling

The Device Plugin Framework

Architecture Overview

Device Plugin Discovery and Registration

The Registration Process

Pod Scheduling Flow

Step 1: User Creates Pod

Step 2: Scheduler Filters and Scores

Step 3: kubelet Allocates GPUs

Step 4: Container Runtime Provisions GPU

Layer 5: GPU Isolation & Visibility

The Magic of NVIDIA_VISIBLE_DEVICES

How It Works

Driver-Level Enforcement

cgroups: Kernel-Level Protection

Layer 6: Complete Flow Example

The Scenario

Step 1: Deploy the Pod

Step 2: Scheduler Assignment

Step 3: kubelet Provisions Container

Step 4: Container Runtime Configuration

Step 5: CUDA Application Runs

Layer 7: Advanced GPU Sharing

Multi-Instance GPU (MIG)

MIG Architecture

MIG Device Files

MIG in Kubernetes

MIG benefits

MIG Trade-offs

GPU Time-Slicing

Device Plugin ConfigMap

How Time-Slicing Works

vGPU (Virtual GPU)

Comparison Matrix

The Container Device Interface (CDI) Revolution

The Problem CDI Solves

The Old Way: Vendor-Specific Runtime Hooks

The New Way: Declarative Device Specifications

CDI Architecture

CDI Architecture

CDI vs Traditional Flow Comparison

Traditional NVIDIA Container Toolkit Flow

CDI-Based Flow

CDI in Kubernetes

Pre-CDI Device Plugin

Post-CDI Device Plugin

Container Runtime Integration

Generating CDI Specifications

Conclusion

Architecture Components

Detailed Flow

Key Components

GPU Device Plugin

Kubelet

Container Runtime (containerd)

NVIDIA Container Toolkit