New Research Uncovers Neuron-Level Insights into Language Model Refusal Mechanisms
Researchers at Nous Research have developed Contrastive Neuron Attribution (CNA), a method that identifies specific MLP neurons responsible for refusing harmful requests in language models.

['Instruction-tuned language models are designed to refuse harmful requests, but the exact mechanisms behind this behavior are not well understood. A new study from Nous Research sheds light on this issue by taking a neuron-level look at the inner workings of these models. The researchers developed Contrastive Neuron Attribution (CNA), a method that identifies the specific MLP neurons whose activations most distinguish harmful from benign prompts.', 'The CNA method works by recording down projection activations at the last token position for a set of harmful and benign prompts.
It then computes the per-neuron mean activation difference between the two sets and selects the top-k neurons by absolute difference. The researchers found that by ablating just 0.1% of MLP activations, they could reduce refusal rates by more than 50% in most instruct models tested, across Llama and Qwen architectures from 1B to 72B parameters. This was achieved while keeping output quality above 0.97 at all steering strengths.', 'The study also compared CNA to other methods, such as Contrastive Activation Addition (CAA) and sparse autoencoders (SAEs).
While CAA was effective in steering model behavior, it was coarse and modified the entire layer-wide signal, leading to degraded output quality at high steering strengths. SAEs, on the other hand, require expensive external training and are sensitive to activation noise. In contrast, CNA requires only forward passes and no auxiliary training or iterative search.', "The researchers verified the causality of the discovered circuit by multiplying each circuit neuron's activation by a scalar multiplier at inference time.
They found that ablating the discovered circuit reduced refusal rates by more than 50% in most instruct models tested, with selected results showing significant reductions. The study also found that the late-layer structure that discriminates harmful from benign prompts exists in base models before any fine-tuning, and that alignment fine-tuning transforms the function of neurons within that existing structure into a sparse, targetable refusal gate.", "The study's findings have significant implications for the development of more transparent and controllable language models. As the researchers note, 'CNA identifies the top 0.1% of MLP neurons whose activations most distinguish one behavior from another — for example, harmful prompts from benign prompts.' By identifying and ablating sparse MLP circuits, it is possible to steer LLM behavior without requiring SAE training or weight modification.
The researchers' work provides a new tool for understanding and controlling language model behavior, and has the potential to improve the safety and reliability of these models."]
Source: MarkTechPost