How Sakana Trained a 7B Model to Orchestrate GPT, Claude, and Gemini LLMs
Sakana AI's RL Conductor, a small language model trained via reinforcement learning, automatically orchestrates a diverse pool of worker LLMs to achieve state-of-the-art results on difficult reasoning and coding benchmarks.

Every LangChain pipeline your team hardcodes starts breaking the moment the query distribution shifts — and it always shifts. That bottleneck is what Sakana AI set out to eliminate. Researchers at Sakana AI have introduced the "RL Conductor," a small language model trained via reinforcement learning to automatically orchestrate a diverse pool of worker LLMs.
Conductor dynamically analyzes inputs, distributes labor among workers, and coordinates among agents. This automated coordination achieves state-of-the-art results on difficult reasoning and coding benchmarks, outperforming individual frontier models like GPT-5 and Claude Sonnet 4 as well as expensive human-designed multi-agent pipelines. It achieves this performance at a fraction of the cost and with fewer API calls than competitors.
RL Conductor is the backbone of Fugu, Sakana AI’s commercial multi-agent orchestration service. The limitations of manual agentic frameworks are well-known. Large language models have strong latent capabilities, but tapping these capabilities to their fullest is a great challenge.
Extracting this level of performance relies heavily on manually designed agentic workflows, which serve as critical components in commercial AI products. However, these frameworks fall short because they are inherently rigid and constrained. In comments to VentureBeat, Yujin Tang, co-author of the paper, explained the exact breaking point of current systems: "While using frameworks with hard-coded pipelines like LangChain and Mixture-of-Agents can work well for specific use cases … In production, an inherent bottleneck arises when targeting domains with large user bases with very heterogeneous demands." Tang noted that achieving "real-world generalization in such heterogeneous applications inherently necessitates going beyond human-hardcoded designs." The RL Conductor is designed to overcome the limitations of rigid, human-designed frameworks.
As the name implies, it conducts an orchestra of agents by dividing challenging problems, delegating targeted subtasks, and designing communication topologies for a set of worker LLMs. Instead of relying on fixed code or static routing, the Conductor orchestrates these models by generating a customized workflow. To test RL Conductor in action, the researchers fine-tuned the 7-billion parameter Qwen2.5-7B using the framework.
During training, the Conductor was tasked with designing agentic workflows of up to five steps. It was given access to a worker pool containing seven different models: three closed-source giants (Gemini 2.5 Pro, Claude-Sonnet-4, and GPT-5) and four open-source models (including DeepSeek-R1-Distill-Qwen-32B, Gemma3-27B, and Qwen3-32B). The team evaluated the Conductor across a variety of highly challenging benchmarks, comparing it against individual frontier models acting alone, self-reflection agents prompted iteratively to improve their own answers, and state-of-the-art multi-agent routing frameworks like MASRouter, Mixture-of-Agents (MoA), RouterDC, and Smoothie.
Source: VentureBeat