Open-source AI search agent Harness-1 outperforms GPT-5.4 on recalling relevant information
Researchers develop Harness-1, a 20-billion parameter open-source search agent that surpasses GPT-5.4 in recalling relevant information.

A joint research collaboration between researchers at the University of Illinois at Urbana-Champaign (UIUC), UC Berkeley, and the open source AI-native vector database platform Chroma unveiled Harness-1, a 20-billion parameter open-source search agent built atop OpenAI's gpt-oss-20B open source model that fundamentally redesigns how AI executes complex retrieval tasks. Harness-1 achieves a massive leap in performance, scoring 73% average on its ability to recall relevant information correctly from a curated dataset, outperforming even GPT-5.4 (70.9%) and the next, most accurate open source search agent, Tongyi DeepResearch 30B, by 11.4 percentage points. Crucially for developers, the model and its environment are available immediately under the highly permissive Apache 2.0 license and model code/weights on Hugging Face.
Harness-1 also serves as proof-of-efficacy of another effort, Tinker, the distributed, web-based AI model training and fine-tuning API developed by Thinking Machines. To actually put these models to the test, the researchers evaluated Harness-1 and its competitors across eight highly complex search benchmarks. Rather than asking simple trivia questions, these tests required the AI to act like a real researcher sifting through diverse, dense data sources.
When the results came in, Harness-1 dominated the open-source competition in its ability to successfully find and curate the right facts. Even more impressively, this relatively small 20-billion parameter model went toe-to-toe with massive, expensive proprietary AI systems. Harness-1 achieves its performance gains by offloading the exhaustive "bookkeeping" of a search session out of the model's working memory and into a structured software environment.
As enterprise use cases grow more sophisticated, demanding that models autonomously sift through thousands of corporate documents or financial filings, these systems frequently succumb to "search amnesia"—forgetting their original queries, looping over rejected documents, or losing track of the specific claims they are trying to verify. Until now, the prevailing solution to this amnesia has been brute force. Engineers typically force models to constantly reread an ever-expanding, append-only transcript of their own actions, piling every search, read, and thought back into a massive context window.
Harness-1 introduces a paradigm shift away from this method, proving that the bottleneck for true artificial autonomy isn't necessarily the size of the model, but how efficiently its working environment manages state. It highlights once more, as Anthropic's Claude Code has also done, that the raw model is arguably less important than the harness — or set of conditions — through which it runs. The training pipeline for Harness-1 represents a fundamental shift in how the AI industry approaches agentic learning.
Source: VentureBeat