AI Models

Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs

AI News Desk

MarkTechPost

May 21, 2026

4 min read

Cohere just released Command A+, as an open-source model targeting enterprise agentic workflows.

Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs

Cohere just released Command A+, as an open-source model targeting enterprise agentic workflows. Available under an Apache 2.0 license, Command A+ is a mixture-of-experts (MoE) model built for high-performance agentic tasks with minimal compute overhead. The model is optimized for reasoning, agentic workflows, RAG, multilingual, and multimodal document processing. It unifies capabilities from four prior models — Command A, Command A Reasoning, Command A Vision, and Command A Translate — into a single scalable model.

Command A+ is a decoder-only Sparse Mixture-of-Experts Transformer with 218B total parameters and 25B active parameters. It has 128 experts, of which 8 are active per token, and a single shared expert is applied to all tokens. In a MoE model, each token is routed through only a subset of expert sub-networks rather than the full parameter set, keeping active compute at 25B-parameter scale at inference time.

The attention layers interleave sliding-window attention layers with Rotational Positional Embeddings and global attention layers without positional embeddings in a 3:1 ratio. The sparse MoE layer is trained in a fully dropless manner and uses a token-choice router, with a normalized sigmoid over the top-k expert logits per token.

Input modalities are text, image, and tool use. Output modalities are text, reasoning, and tool use. The model supports a 128K input context length and a 64K max generation length.

Three quantization variants are available with minimum GPU requirements: BF16 (16-bit) requires 4× B200 or 8× H100 GPUs; FP8 (8-bit) requires 2× B200 or 4× H100 GPUs; W4A4 (4-bit) runs on a single B200 or 2× H100 GPUs. All three quantizations show negligible differences in benchmark quality. Cohere recommends W4A4 for most deployments.

Cohere applies NVFP4 W4A4 quantization, 4-bit weights and activations with two-level scaling, to the MoE experts only. The attention path, including Q/K/V/O projections, the KV cache, and attention compute, is kept at full precision.

To close residual quality gaps, Cohere uses Quantization-Aware Distillation (QAD) in the post-training phase: the quantized student model is trained to match the full-precision teacher’s output distribution, using fake quantization operators in the forward pass and straight-through estimators on the backward pass.

On τ²-Bench Telecom, scores improved from 37% to 85% over Command A Reasoning, and Terminal-Bench Hard agentic coding performance reached 25% from 3%.

On internal North platform evaluations, all scored using LLM-as-a-judge techniques, Agentic Question Answering accuracy improved by 20% over Command A Reasoning. Agentic QA measures how well the model answers enterprise questions using MCP-connected cloud file systems. Spreadsheet analysis quality improved by 32%, and Memory Usage Quality — measuring how well an agent leverages information from a previous session to answer questions in a subsequent session — scored 54% with Command A+ compared to 39% with Command A Reasoning.

Source: MarkTechPost