Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window
AI News Desk
·
MarkTechPost
··
6 min read
Most AI models today are not designed for sustained, multi-step autonomous execution.
Most AI models today are not designed for sustained, multi-step autonomous execution. Tasks like running hundreds of iterative code modifications, or chaining tool calls across hours without human intervention, require a different kind of model architecture and training focus.
Alibaba’s Qwen team formally announced Qwen3.7-Max at the 2026 Alibaba Cloud Summit on May 20. Although, two preview versions of the Qwen3.7 series quietly appeared on Arena AI’s leaderboard with no press release and no official API announcement.
Alibaba previewed two models simultaneously: Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview. They ranked 13th globally in text capabilities and 16th in vision capabilities, respectively, according to LM Arena.
In Text Arena, Qwen3.7-Max-Preview ranked #13 overall, placing Alibaba as the #6 lab in text. In Vision Arena, Qwen3.7-Plus-Preview ranked #16 overall, placing Alibaba as the #5 lab in vision. The model rank and the lab rank are separate figures.
Qwen3.7-Plus-Preview is described as a high-performance balanced version preview, focusing on reasoning and logical expression, with its toolchain to be gradually opened in the future. It handles vision and multimodal inputs. Qwen3.7-Max is the text-only reasoning flagship. This article covers Qwen3.7-Max, as it is the model Alibaba formally announced with API access.
Alibaba Qwen team described Qwen3.7-Max as its most advanced and comprehensive agent model to date. The model is proprietary and closed-weight. It is capable of handling coding and debugging, office workflow automation, and long-horizon tasks spanning hundreds or even thousands of steps.
Qwen3.7-Max is a reasoning model. The model generates a chain of thought first — an internal sequence of steps where it plans, checks its work, and corrects course before committing to a final answer. On interfaces like Qwen Chat, this shows up as a ‘Thinking’ mode you can switch on to see the model’s reasoning trace.
Reasoning models produce significantly more output tokens than standard completions. When Artificial Analysis ran its Intelligence Index evaluation, Qwen3.7-Max generated about 97 million tokens, compared to an average of 24 million for models on that benchmark. For short or simple tasks, this overhead adds latency without improving output quality. For multi-step planning, code refactoring, or long agent chains, extended-thinking mode is where the model’s strength applies.
Source: MarkTechPost
The model features a 1M token context window, up from 256K on Qwen3.6 Max Preview. It supports text input and output only. Pricing has not yet been announced. Qwen3.6 Max Preview was priced at $1.30/$7.80 per million input/output tokens on Alibaba Cloud.
A million-token context window can hold a full mid-sized code repository or a large stack of documents in a single request. Models often reason less reliably as the context window fills. Independent long-context testing for Qwen3.7-Max is not yet available.
Qwen3.7-Max scored 56.6 on the Artificial Analysis Intelligence Index, placing it fifth overall. That represents a 4.8-point gain over its predecessor Qwen3.6 Max Preview (51.8), and puts it ahead of Google’s Gemini 3.5 Flash (55.3). GPT-5.5 (60.2), Claude Opus 4.7 (57.3), and Gemini 3.1 Pro Preview (57.2) still lead the overall rankings.
The Intelligence Index v4.0 aggregates ten evaluations, including GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, Humanity’s Last Exam, and GPQA Diamond.
The improvement over Qwen3.6 Max Preview is not uniform. Most of the Index gains are concentrated in scientific reasoning, agentic capability, and coding. CritPt rose 9.7 percentage points (from 3.7% to 13.4%), Humanity’s Last Exam jumped 9.2 points (from 28.9% to 38.1%), and Terminal-Bench Hard climbed 6.9 points (from 43.9% to 50.8%). GDPval-AA added 42 Elo points (from 1504 to 1546). Scores on other benchmarks are largely flat compared to Qwen3.6 Max Preview.
One result on the Index requires careful reading. On AA-Omniscience, Qwen3.7-Max’s raw accuracy actually dropped 7.6 percentage points (from 37.7% to 30.1%), while its hallucination rate fell 21.3 points (from 44.2% to 22.9%). The model is choosing to say “I don’t know” more often rather than recalling more facts. Its attempt rate fell from 67.3% to 48.0%, the lowest among frontier models in the comparison. The AA-Omniscience benchmark rewards correct answers and penalizes hallucinations but has no penalty for refusing to answer. For use cases that depend on broad factual recall, this is a meaningful limitation to test against your workload.
In Text Arena, Qwen3.7-Max-Preview ranked #13 overall with an Elo score of 1,475. Category rankings include #7 in Math, #9 in Expert Prompts, #9 in Software and IT, and #10 in Coding.
All benchmark numbers are preliminary. The model carries a ‘Preview’ mode, indicating Alibaba considers it an early build.
In an internal Alibaba test on a new chip platform, the model autonomously performed more than 1,000 tool calls and iterative code modifications to optimize a key kernel. Alibaba claimed the process improved inference speed by roughly 10x compared with the previous version.
1 million tokens — enough to fit a full mid-sized code repository in a single request.
Uses chain-of-thought (extended-thinking mode) before producing a final answer.
Text in, text out. No image input supported in this model.
Use qwen3.7-max when calling via Alibaba Cloud Model Studio.
Use your hardest real-world prompts when testing. Multi-step math problems, complex refactoring requests, and ambiguous expert questions reveal more about model quality than simple prompts.
Get your API key from Alibaba Cloud Model Studio (DashScope). The base URL for international access is dashscope-intl.aliyuncs.com .
Pricing has not yet been announced for Qwen3.7-Max. For reference, Qwen3.6 Max Preview was priced at $1.30 / $7.80 per million input/output tokens.
Multi-step code refactoring, complex math proofs, long agent task chains, and ambiguous problems requiring step-by-step planning.
Short rewrites, simple classifications, quick lookups, or tasks where latency and token cost need to be minimised.
Qwen3.7-Max generated ~97M tokens on Artificial Analysis benchmarks, vs. an average of 24M for comparable models. Each thinking token adds to latency and cost — use thinking mode selectively.
The 35-hour and 1,000+ tool call figures come from Alibaba’s internal testing only. No independent verification exists for these specific claims.
Qwen3.7-Max is text-only. For multimodal tasks, use Qwen3.7-Plus-Preview instead, which supports vision input.
On the AA-Omniscience benchmark, the model’s attempt rate dropped from 67.3% to 48.0%. It abstains more and hallucinates less — but its raw factual recall also dropped. Test carefully for knowledge-recall tasks.
The model currently carries a — Preview suffix. Benchmark scores, behaviour, and pricing can change before stable release. No open-weight version is available as of May 2026.
A 1M token context window is a ceiling, not a guarantee. Independent long-context testing for Qwen3.7-Max is not yet available. Validate retrieval quality on your specific workload.
For the latest model updates, check the official Qwen blog at qwen.ai/blog and Alibaba Cloud Model Studio docs.
Check out the Technical details. and Docs . Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window appeared first on MarkTechPost .