Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation
NVIDIA's large-scale world model, Cosmos Predict 2.5, can generate physically plausible videos conditioned on text, images, or video clips, and can be fine-tuned with LoRA or DoRA for specific domains like robot manipulation.

['NVIDIA Cosmos Predict 2.5 is a large-scale world model capable of generating physically plausible videos conditioned on text, images, or video clips. To adapt it to a specific domain, such as robot manipulation or a particular camera viewpoint, teams still need targeted fine-tuning.', 'Training robot policies requires demonstration data, but collecting real-robot trajectories is slow and expensive. Generating synthetic trajectories with a fine-tuned video world model offers a scalable alternative.
However, full fine-tuning of a 2B-parameter model is expensive and risks catastrophic forgetting of general knowledge. LoRA and DoRA inject small trainable adapter modules into the frozen base model, reducing memory requirements while keeping the adapter files small and portable. This makes it practical to fine-tune on a single GPU and flexibly swap adapters for different domains at inference.', 'This guide walks through parameter-efficient fine-tuning of Cosmos Predict 2.5 with LoRA and DoRA, using the diffusers and accelerate libraries with support for both single- and multi-GPU training.
We then show how to use the fine-tuned model to generate synthetic robot trajectories for downstream robot learning tasks.', 'After installing diffusers, navigate to examples/cosmos to explore the example code. We use the same datasets as the GR00T Dreams post-training recipe. Download and preprocess the training and test datasets using download_and_preprocess_datasets.sh.
The resulting training dataset folder looks like this: ... The eval dataset is a flat directory of paired .txt and .png files for the (prompt, image) pairs:', 'In this section, we walk through the implementation in train_cosmos_predict25_lora.py. VideoDataset loads each sample as a (caption, video) pair from args.train_data_dir ( gr1_dataset/train in our example).
For videos longer than args.num_frames , it samples a random contiguous window of args.num_frames each epoch, enabling temporal augmentation. Internally, VideoProcessor from diffusers.video_processor resizes and normalizes the raw frames into a tensor of shape (channels, frames, height, width). Cosmos Predict 2.5 consists of three submodules: ...', "During training, all VAE, text encoder, and DiT weights are frozen.
LoRA adapters are injected into the DiT's attention projections ( to_q , to_k , to_v , to_out.0 ) and feedforward layers ( ff.net.0.proj , ff.net.2 ). The trainable LoRA parameters are then upcast to float32 for numerical stability under bf16 mixed precision. Passing use_dora=True switches to DoRA, which decomposes each weight into magnitude and direction before applying the low-rank update.
Source: Hugging Face