Researchers Train Foundation Model from Scratch for $1,500
Sapient's HRM-Text model achieves competitive performance with much larger models at a fraction of the cost and training data.

Model from Scratch for $1,500">
Training a foundation large language model (LLM) from scratch typically costs millions of dollars and requires internet-scale data, which is why most enterprises don't bother. Sapient thinks it has found a cheaper path with HRM-Text, a model that replaces standard Transformers with a highly sample-efficient Hierarchical Recurrent Model (HRM). This architecture decouples computation into slow-evolving strategic and fast-evolving execution layers, allowing the model to train exclusively on instruction-response pairs.
The researchers were able to train a 1B-parameter HRM-Text model from scratch at a fraction of the cost and with far fewer tokens than normal LLMs. The model achieved performance competitive with much larger open models on key industry benchmarks. For real-world AI applications, this means foundational pretraining is no longer restricted to highly resourced institutions.
Organizations can now affordably pretrain their own highly capable reasoning models from scratch and pair them with external knowledge stores. The current approach to training LLMs is brute force: scrape the internet, run next-token prediction trillions of times, and assume the model has developed a working internal model of the world. However, this approach wastes millions of dollars of computing power forcing models to memorize everything collected from the internet, just so they can indirectly learn how to think.
Guan Wang, CEO of Sapient Intelligence, framed this as an issue of the "economics of iteration." "Enterprises today face three compounding problems: training is expensive, infrastructure is heavy, and experimentation cycles are too slow," Wang said. "The industry's scaling addiction says: 'When the model fails, make it bigger. Add more data.
Add more GPUs.' That has worked, but it is reaching a point of diminishing returns. More scale often means more memorization, more latency, more infrastructure, and more vendor dependency. It does not necessarily give an enterprise a better reasoning engine." The researchers built a highly compact 1-billion-parameter HRM-Text model, training it from scratch on a tightly curated dataset of just 40 billion tokens.
The training data consisted entirely of instruction-response pairs across general instructions, math, symbolic logic, textbook exercises, and rewritten knowledge. The model was trained using a task-completion objective and achieved competitive scores using 100 to 900 times fewer training tokens and 96 to 432 times less estimated compute than models like Qwen, Gemma, and Llama. The results show a significant shift in the compute-to-performance frontier.
Source: VentureBeat