NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes
NVIDIA's AI research team introduces Dynamo Snapshot, a checkpoint/restore approach to reduce AI inference startup time on Kubernetes from minutes to seconds.

NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes">
In production AI inference deployments, demand for computing resources fluctuates over time, requiring inference replicas to scale elastically. However, cold-starting inference workloads on Kubernetes can take several minutes, during which time GPUs are allocated but idle, generating no tokens and serving no requests. This delay, known as 'cold start,' refers to the full sequence a model server must complete before serving any request: pulling the container image, loading model weights into GPU memory, warming up CUDA kernels, compiling or capturing CUDA graphs, and registering with the service discovery layer.
The cold-start latency for a single-GPU vLLM (v0.20.0) workload breaks down into three segments: container/image pull, engine initialization (weight loading, kernel warmup, graph compilation), and distributed runtime startup. To address this challenge, NVIDIA's AI research team has introduced NVIDIA Dynamo Snapshot, a checkpoint/restore approach for AI inference workloads on Kubernetes. Dynamo Snapshot uses CRIU (Checkpoint/Restore in Userspace) to serialize the process tree's state to disk and cuda-checkpoint to dump the device state to CPU memory.
The checkpointing process involves two components: device state (GPU-side) and host state (CPU-side). The device state includes CUDA contexts, streams, device memory, and virtual address mappings, while the host state includes CPU memory, threads, file descriptors, and namespaces. When checkpointing, the two tools run in order: cuda-checkpoint dumps all device state into CPU memory first, then CRIU dumps all host-side process tree state to a folder in storage.
The snapshot agent, a privileged DaemonSet, handles checkpoint and restore for runc-managed containers without requiring modifications to runc itself. The agent waits for the workload's readiness probe, invokes cuda-checkpoint and CRIU from the host side, and writes the artifact to shared storage. NVIDIA provides a Helm chart to install the snapshot-agent DaemonSet.
The DynamoCheckpoint custom resource defines which model configuration to checkpoint, and the DynamoGraphDeployment CR references the checkpoint for restore. The team also developed quiesce/resume hooks to handle coordination before checkpointing and after restoration. The worker writes a 'ready for checkpoint' signal file after engine initialization but before distributed runtime startup, then enters a polling loop waiting for a 'restore complete' signal file while the snapshot agent checkpoints it externally.
After measuring peak GPU memory usage, inference engines allocate the remaining GPU memory as a large KV cache buffer. To reduce the checkpoint size, the team used the CUDA Virtual Memory Management API to allocate the KV cache and free the underlying physical allocation, keeping the virtual address range intact. The results show that for Qwen3-0.6B on a B200, this reduces the total artifact size from ~190 GiB to ~6 GiB.
Source: MarkTechPost