Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude
Zyphra has released Zamba2-VL, a family of open vision-language models.

Language Models That Cut Time-to-First-Token by About an Order of Magnitude">
Zyphra has released Zamba2-VL, a family of open vision-language models. The release covers three sizes: 1.2B, 2.7B, and 7B parameters. Each model is built on the Zamba2 hybrid SSM–Transformer backbone.
Vision-language models (VLMs) read images and text together. They answer questions about charts, documents, and photos. Most open VLMs use a dense Transformer as the language model. Zamba2-VL replaces that with a hybrid state-space design. The goal is competitive accuracy at lower latency.
Zamba2-VL follows the now-standard LLaVA-style VLM template. A pre-trained vision encoder turns image patches into features. A lightweight MLP adapter projects those features into the language model’s space. The language model then reads an interleaved sequence of vision and text tokens. The models support single and multi-image understanding and grounding.
Zyphra pairs each Zamba2 backbone with the Vision Transformer from Qwen2.5-VL. That encoder was chosen for two specific properties. It uses 2D rotary position embeddings and native dynamic-resolution processing. A two-layer MLP adapter connects the encoder to the backbone.
The Zamba2’s backbone is where the design diverges from typical VLMs. It is a hybrid of Mamba2 state-space layers and shared transformer blocks. The Mamba2 layers run in linear time with a fixed-size state. A small number of shared attention layers are interleaved between them. Each shared block carries a unique LoRA adapter at each layer.
The Mamba2 layers carry the bulk of computation cheaply. The shared attention layers preserve in-context retrieval that pure-SSM models give up. The hybrid trades full-attention expressivity against state-space efficiency.
Zamba2-VL uses the Mistral v0.1 tokenizer. It was trained on 100B tokens of vision-text and pure-text data. That data was sourced from open web datasets.
The research team evaluated Zamba2-VL across 14 benchmarks. These span chart, diagram, and document understanding. They also cover general perception, reasoning, and visual counting. All scores come from Zyphra’s evaluation harness, which is based on VLMEvalKit. The report compares against the Molmo2, Qwen3-VL, and InternVL3.5 families.
InternVL3.5-2B and Qwen3-VL-2B are similar in size. Molmo2-4B and Qwen3-VL-4B are larger.
The pattern is uneven and worth understanding. Counting is the strongest category. Zyphra reports Zamba2-VL-1.2B at 62.5 on PixMoCount. That compares with 32.8 for InternVL3.5-1B and 17.7 for PerceptionLM-1B. Document understanding also holds up, with DocVQA at 90.9 for the 2.7B model. The model lags larger baselines on knowledge-heavy reasoning, such as MMMU and MathVista.
Source: MarkTechPost