VibeThinker-3B: A 3B Dense Reasoning Model Built on Qwen2.5-Coder-3B With the Spectrum-to-Signal Post-Training Pipeline
While recent breakthroughs in AI reasoning have largely been driven by massive scale, pouring in billions of parameters to cross complex cognitive thresholds— VibeThinker-3B is charting a completely different path.

While recent breakthroughs in AI reasoning have largely been driven by massive scale, pouring in billions of parameters to cross complex cognitive thresholds— VibeThinker-3B is charting a completely different path.
Created by researchers from Sina Weibo Inc (China), this 3-billion-parameter model proves that efficiency can punch far above its weight class. Released under an open-source MIT license, VibeThinker-3B matches the performance of models hundreds of times its size on verifiable tasks like mathematics, coding, and STEM disciplines.
VibeThinker-3B is a compact dense model built on the Qwen2.5-Coder-3B base. It is post-trained, not pretrained from scratch. The research team applies supervised fine-tuning, reinforcement learning, and self-distillation on top.
The training framework continues the Spectrum-to-Signal Principle (SSP) from the earlier VibeThinker-1.5B. SFT (Supervised Fine-Tuning) builds a broad space of valid reasoning paths, the ‘Spectrum.’ RL then amplifies the correct paths, the ‘Signal.’
The model targets one job: reasoning where a verifier can confirm the answer. The research team recommends larger general models for open-domain knowledge tasks. VibeThinker-3B is a specialist by design.
It runs on standard stacks. The model weights require transformers>=4.54.0 . For faster inference it recommends vLLM==0.10.1 or SGLang>=0.4.9.post6 . The BF16 weights are roughly 6 GB, small enough for a single GPU.
On AIME26, VibeThinker-3B scores 94.3. According to the research paper, this is comparable to DeepSeek V3.2 (671B) and Kimi K2.5 (1T).
On LiveCodeBench v6, it reaches 80.2 Pass@1. On OJBench, another code benchmark, it scores 38.6, below the largest models. On HMMT25 it scores 89.3, and on BruMO25 it reaches 93.8. On IMO-AnswerBench, a 400-problem IMO-level set, it scores 76.4.
The table below compares it against much larger reasoning models. The ‘+CLR’ row uses test-time scaling. It stands for Claim-Level Reliability Assessment
The pattern is consistent. On verifiable math and code, the 3B model sits near the top cluster. On GPQA-Diamond, a knowledge-heavy benchmark, the gap to large models stays visible.
The research team also ran an out-of-distribution coding test. It used recent LeetCode weekly and biweekly contests, from Apr 25 to May 31, 2026. The model passed 123 of 128 first-attempt Python submissions. That is a 96.1% acceptance rate on unseen problems.
Source: MarkTechPost