Weibo's VibeThinker-3B sparks debate over AI benchmarks and scaling laws
A 3 billion parameter language model from Sina Weibo achieves benchmark scores comparable to much larger models, sparking debate over AI benchmarks and scaling laws.

On Sunday, a team of nine researchers at Sina Weibo, the Chinese social media giant, quietly posted a 14-page technical report to arXiv that sent shockwaves through the AI research community. Their claim: a language model with just 3 billion parameters can match or exceed the reasoning performance of flagship systems from Google DeepMind, OpenAI, Anthropic, and DeepSeek that are hundreds of times larger. The model, called VibeThinker-3B, scored 94.3 on AIME 2026, the American Invitational Mathematics Examination, one of the most demanding standardized math competitions in the world.
That figure places it alongside DeepSeek V3.2, a model with 671 billion parameters, and ahead of Gemini 3 Pro, Google's high-performance flagship reasoning system, which scored 91.7. With a test-time scaling technique the team calls Claim-Level Reliability Assessment, the score climbs to 97.1, edging past virtually every system in the public record. Within hours of publication, the paper had drawn 62 upvotes on Hugging Face's daily papers feed, the model repository had accumulated 130 likes, and the GitHub repository had reached 685 stars.
But the reaction on social media was not uniformly celebratory. "WHAT THE HELL is happening in AI?" wrote the user @orcus108 on X, in a post that accumulated over 161,000 views. "A 3B parameter model just put up coding benchmark scores in the same league as Claude Opus 4.5… I genuinely don't know if this is a breakthrough or if the benchmarks are broken." The results reported in the technical report are, by any conventional standard, extraordinary.
On the mathematics side, VibeThinker-3B achieved 91.4 on AIME 2025, 94.3 on AIME 2026, 89.3 on HMMT 2025, 93.8 on BruMO 2025, and 76.4 on IMO-AnswerBench. In coding, it posted an 80.2 Pass@1 on LiveCodeBench v6 and achieved a 96.1 percent acceptance rate on unseen LeetCode weekly and biweekly contests. The researchers frame this result not as an anomaly but as evidence for a broader theoretical claim.
They introduce what they call the "Parametric Compression-Coverage Hypothesis," which argues that different types of AI capability have fundamentally different relationships to model size. The paper acknowledges this distinction directly. On GPQA-Diamond, a graduate-level science knowledge benchmark, VibeThinker-3B scored just 70.2 — well behind the 91.9 achieved by Gemini 3 Pro and the 87.0 scored by Claude Opus 4.5.
The authors write that this gap "is consistent with our claim rather than a contradiction to it: the main finding is not that a 3B model has fully replaced leading general-purpose models, but that a small model can reach first-tier performance on many verifiable reasoning tasks." VibeThinker-3B is not built from scratch. It is post-trained on top of Qwen2.5-Coder-3B, a compact foundation model from Alibaba's Qwen team, through what the Weibo AI researchers call the "Spectrum-to-Signal Principle" — a multi-stage pipeline first introduced in the team's earlier VibeThinker-1.5B work. The training unfolds in four major phases.
Source: VentureBeat