Granite 4.1 LLMs: A Technical Walkthrough of Data Engineering and Training
A detailed look at the data engineering, pre-training, supervised fine-tuning, and reinforcement learning behind the Granite 4.1 LLMs.
['The development of high-quality language models requires more than just scaling up compute resources. It demands rigorous data curation throughout the training process. The Granite 4.1 family of dense, decoder-only LLMs (3B, 8B, and 30B) was trained on approximately 15 trillion tokens using a multi-stage pre-training pipeline that includes long-context extension of up to 512K tokens.
The models are further refined with supervised fine-tuning on around 4.1 million high-quality curated samples and reinforcement learning via on-policy GRPO with DAPO loss.', 'The Granite 4.1 models use a decoder-only dense transformer architecture with key design choices including Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, RMSNorm, and shared input/output embeddings. All three model sizes share the same training pipeline and data strategy, differing only in architecture dimensions. The training process involves a five-phase strategy: foundational pre-training, mid-training with progressively higher-quality data annealing, and long-context training extending the context window to 512K tokens.', "The pre-training pipeline is followed by supervised fine-tuning (SFT), which turns the base model into a reliable instruction-following assistant.
This stage emphasizes data quality, utilizing an LLM-as-Judge framework alongside rule-based filtering to curate high-quality samples. The framework assesses each sample against structural, semantic, and behavioral criteria, ensuring that only the highest-quality data is used for fine-tuning. After SFT, a multi-stage reinforcement learning pipeline is applied to further improve the model's capabilities across specific domains.", 'The reinforcement learning pipeline consists of four sequential stages: Multi-domain RL, RLHF, Identity and Knowledge-calibration RL, and Math RL.
This approach allows the model to learn from a diverse set of tasks and domains, preventing catastrophic forgetting and maximizing performance. The Granite 4.1 models deliver competitive instruction-following and tool-calling capabilities without relying on long chains of thought, making them a production-ready, open-source choice for enterprise workloads.', 'Notably, the 8B instruct model matches or surpasses the previous Granite 4.0-H-Small (32B-A9B MoE) despite using a simpler dense architecture with fewer parameters. The Granite 4.1 family of models is available under the Apache 2.0 license, and fp8 quantized variants have been released, optimized for inference with vLLM.
These models were trained on an NVIDIA GB200 NVL72 cluster hosted on CoreWeave, providing the scalable, high-bandwidth interconnect needed for efficient distributed training.']
Source: Hugging Face