StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows
StepFun today released Step 3.7 Flash, a multimodal Mixture-of-Experts model targeting agentic use cases.

Coding Agents and Search Workflows">
StepFun has unveiled Step 3.7 Flash, a cutting-edge multimodal Mixture-of-Experts (MoE) model designed to enhance coding agents and search workflows. This latest iteration builds upon the capabilities of Step 3.5 Flash, adding native vision input and improved tool-use reliability. Step 3.7 Flash boasts an impressive 198 billion parameters, combining a 196 billion-parameter language backbone with a 1.8 billion-parameter vision encoder (ViT) for native image understanding.
During inference, the model activates approximately 11 billion parameters per token, maintaining a balance between performance and computational efficiency. This sparse MoE architecture allows for efficient computation, comparable to an 11 billion dense model, while retaining the benefits of a much larger model. The model offers three selectable reasoning depths - low, medium, and high - enabling developers to trade latency for reasoning depth.
This flexibility is particularly useful in applications where speed and accuracy are crucial. Step 3.7 Flash has demonstrated significant improvements over its predecessor, Step 3.5 Flash, across various benchmarks. On SWE-Bench Pro, it scores 56.26%, up from 51.3%, and on Terminal-Bench 2.1, it achieves 59.55%, surpassing 53.37%.
Step 3.7 Flash also supports Advisor Mode, StepFun's implementation of the advisor strategy described by Anthropic. This feature allows the model to run the agentic loop end-to-end, escalating to a larger advisor model only at specific inflection points. In SWE-Bench Verified, Step 3.7 Flash reaches 97% of Claude Opus 4.6's coding performance at roughly one-ninth the per-task cost ($0.19 vs.
$1.76 per task). The model offers two visual tool pathways: Visual Search Tool and Python Tool. The Visual Search Tool invokes a visual search tool to retrieve and verify information for recognition tasks, achieving 79.16% on SimpleVQA (with Search).
The Python Tool enables fine-grained visual tasks, such as high-resolution image analysis, scoring 95.29% on V (a self-tested score with Python).
Source: MarkTechPost