One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing
ByteDance's new AI model, Lance, integrates image and video understanding, generation, and editing capabilities in a single architecture.

["Building a single model that can both understand and generate images and videos is a challenging task. The two tasks require different approaches, with understanding benefiting from high-level semantic features and generation needing low-level continuous representations. Most systems handle this tension by separating the two into distinct architectures, then bridging them post-hoc.
However, ByteDance's research team took a different approach with Lance, designing a model that natively integrates understanding, generation, and editing across both image and video modalities — trained jointly from the start.", 'Lance organizes its capabilities into three output families: text (X2T), images (X2I), and videos (X2V). On the understanding side, this covers image and video captioning, visual question answering, OCR, visual grounding, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-driven generation, image editing, and video editing — including multi-turn consistency editing across both modalities.
This all-in-one capability is a major milestone, as standard unified architectures typically stop at basic image understanding and text-to-image generation.', 'The architecture of Lance is based on two principles: unified context modeling and decoupled capability pathways. For unified context, Lance converts all inputs — text, images, and videos — into a single shared interleaved multimodal sequence. The model then runs generalized 3D causal attention over the full context, with text tokens using causal attention and visual tokens using bidirectional attention.
For decoupled pathways, Lance uses a dual-stream mixture-of-experts architecture initialized from Qwen2.5-VL 3B, with the understanding expert (LLMUND) handling text and semantic visual tokens, and the generation expert (LLMGEN) handling VAE latent tokens for visual synthesis and editing.', 'The development of Lance involved a four-stage training process, including pre-training, continual training, supervised fine-tuning, and reinforcement learning. The model was trained using a maximum training budget of 128 GPUs. The results show that Lance achieves impressive performance across various tasks, including image generation, video generation, image editing, and video understanding.
For example, on GenEval, Lance scores 0.90 overall, matching TUNA for the top spot among unified models. On VBench, Lance achieves a Total Score of 85.11, the highest among unified models.', 'The Lance model is available on GitHub, and users can follow the provided instructions to install and run the model. The repository includes the inference scripts, Gradio interface, benchmark scripts, and model configuration files.
The model weights are also available on the Hugging Face repository. ByteDance has also provided a step-by-step guide to installing and running Lance, which requires CUDA-capable hardware with significant VRAM.']
Source: MarkTechPost