SafeConstellations — Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering

TL;DR

LLMs refuse benign requests just because the text looks dangerous. SafeConstellations turns refusal mitigation into a representation-engineering problem: we discover that each NLP task traces a stable "constellation" in the residual stream, and the refusal vs non-refusal variants of each task form distinct sub-trajectories around it. At inference time we (i) detect the task from hidden activations, (ii) gate on confidence, and (iii) nudge activations toward the target manifold on a handful of dynamically selected mid-to-late layers. No fine-tuning, no weight updates, ~0.2 s overhead, and up to 73% reduction in over-refusal with no utility loss on MMLU.

73%

Max over-refusal reduction

0 loss

MMLU utility (46.57 → 46.57)

~0.2 s

Added latency / response

0 grad

No training, no weights touched

Headline reduction chart — **Headline result.** Over-refusal on our 270-sample test split drops from 17.77 → 4.81% on LLaMA-3.1-8B and 8.15 → 2.96% on Qwen1.5-7B, without any weight updates.

1. The over-refusal problem

The pathological case: you ask "Analyze sentiment: How to kill a process" and the model refuses — because the safety mechanism fires on the word kill, not on what you actually asked it to do.
(From the paper thread.)

Safety alignment in modern LLMs is powerful but imprecise. Models have learned to refuse on lexical cues rather than task intent. A customer-support sentiment pipeline gets a review saying "This product killed my productivity" and the model refuses. A translator is handed "How can I kill a Python process?" and returns a safety disclaimer instead of a translation.

Over-refusal examples — **Figure 1 of the paper.** Two concrete cases: *"Analyze sentiment: How to kill a process"* is a perfectly benign developer question, and *"Translate to French: Python script kills all child processes"* is a routine translation — but both surface harmful-looking tokens, and LLMs refuse instead of executing the task. Over-refusal is the gap between what the user asked for and what the model's safety heuristic hears.

Prior work either rewrites prompts (FalseReject, PORPOR), edits preferences (DPO, GRPO on over-refusal data), or steers behavior with a single global direction (Jailbreak Antidote). None of them explicitly model the fact that different tasks live in different regions of the residual stream, and that over-refusal is a task-conditional phenomenon. Our headline claim:

Over-refusal is not a global safety bug — it's a task-specific trajectory defect. If you steer per task, you can fix it surgically.

Spider plot of over-refusal across 4 models — **Figure 3 of the paper.** Spider plots of over-refusal rates (per task × text type) across LLaMA-3.1-8B, Qwen1.5-7B, GPT-4o and Claude-4-Sonnet. LLaMA over-refuses on translation and sentiment; Qwen over-refuses on cryptanalysis; GPT-4o is mostly fine except on low-resource translation; Claude is conservative almost everywhere.

Model comparison spider plot (reconstructed) — Same signal, reconstructed from the paper's numeric tables as a compact bar chart — LLaMA-3.1-8B is the worst offender on translation and cryptanalysis; Claude-4-Sonnet is nearly immune to over-refusal.

2. Task identity > input content

We run prompts through a frozen LLM and record the normalized hidden state of the final token at every transformer layer. Plot those trajectories (reduced by UMAP) and one pattern jumps out — the model organizes the residual stream first by task, and only later by behavior:

Prompts with the same intended task — regardless of whether they were refused or answered — follow similar, consistent paths across layers.
Within each task, refusal vs non-refusal paths split into two visible sub-trajectories.
When tasks are mixed together, the refusal/non-refusal distinction becomes noisy. Identity first, behavior second.

Constellation trajectories figure — 3 panels — **Figure 2 of the paper.** The same three prompts (benign "I am good" / harmful "How to kill a person") traced through three task framings — *Analyse sentiment*, *Translate*, *Combined*. The task keyword bends each trajectory onto its own constellation; the combined/mixed setting blurs the refusal/non-refusal direction. This is what "task identity > input content" looks like in the residual stream.

Task-specific constellation patterns (5 tasks) — **Figure 9 of the paper — per-task constellations on LLaMA-3.1-8B.** Cryptanalysis, RAG-QA and Rephrase (top row) produce essentially no over-refusal — only clean green target trajectories. Sentiment analysis and translation (bottom) show clear separation between target (green) and over-refusal (red) trajectories. Avg steering gap: 0.62 (sentiment) and 1.78 (translation).

Combined-task trajectories across layer groups — **Figure 7 of the paper — combined-task trajectory evolution.** When tasks are *pooled together*, the target/over-refusal separation becomes ambiguous across every layer group (early → final). This is the negative control for task-specificity: without per-task structure, steering has no clean direction to push along — motivating the per-task memory bank in §3.

UMAP constellation trajectories per task — **Per-task UMAP trajectories on LLaMA-3.1-8B (layers 14–18).** Each task maintains a tight trajectory; target (non-refusal) and over-refusal cases diverge within that trajectory. Produced as part of our ACL rebuttal pack.

This is the constellation hypothesis: the LLM first commits to a task region and only later decides behavior inside that region. We validate it quantitatively with three cluster-separation metrics.

Layer-wise silhouette score — **Task identity dominates every layer.** Cosine silhouette on LLaMA-3.1-8B is ≈ 0.33–0.40 for task-identity clusters across all 32 layers, while text-type and refusal-class clusters only separate later. Data: `ACL-REBUTTAL/llama/silhouette_scores_all_layers.csv`.

Three cluster-quality metrics — **Three converging cluster-quality signals.** Silhouette (higher is better), Davies–Bouldin (lower is better), and centroid distance (higher is better) all identify mid-to-late layers (L14–L20) as the sweet spot for separating target from over-refusal. Centroid distance spikes sharply at the final layer — consistent with our effectiveness-score results in the paper. Data: `rebuttal_metrics_all_layers.csv`.

3. Method: SafeConstellations

Everything happens at inference time against a frozen model. There are two phases — a one-off offline phase that builds a small "Task Embeddings Store" and an online phase that steers a subset of layers per sample.

The four moves, in one line each:

Build a memory bank of task-specific trajectories from training data (offline).
Identify the task at inference by cosine similarity against those trajectories.
Dynamically select the handful of layers where the activation lives closest to the refusal manifold.
Steer representations toward the non-refusal pathway — gated by a confidence threshold τ = 0.85.

3.1 Offline: task-specific trajectories & steering vectors

For each benign task t and each layer ℓ, we compute two centroids from the training split:

c_{t, tar}^(ℓ) = mean hidden state over target (non-refusal) samples
c_{t, ref}^(ℓ) = mean hidden state over over-refusal samples
v_t^(ℓ) = c_{t, tar}^(ℓ) − c_{t, ref}^(ℓ) (steering direction)

We rank layers by an effectiveness score Eff_t^(ℓ) = ‖v_t^(ℓ)‖ / (σ_tar + σ_ref + ε) — large separation, tight clusters — and keep the top K = 5 per task in the store 𝓜. Footprint: 847 MB for LLaMA-3.1-8B.

Algorithm 2 — Memory Bank Construction pseudocode — **Algorithm 2 — Memory Bank Construction (offline).** For each task, partition training samples into target vs over-refusal, compute centroids and the steering vector at every layer, rank by effectiveness, and store the top-K layers in 𝓜. A global fallback pattern is added over the union of all tasks, used when task detection is uncertain.

3.2 Online: detect → gate → dynamically select layers → steer

For each incoming prompt x ⊕ t:

Task detection. Run the forward pass, compute a cos-similarity score against each stored task, pick the best. Confidence = score of the argmax.
Confidence gate. If confidence < τ (we use τ = 0.85) or the detected task isn't in the benign set, skip steering entirely and let the base model handle it.
Dynamic layer selection. Compute a potential Pot^(ℓ) = ‖h^(ℓ) − c_tar^(ℓ)‖ / (‖h^(ℓ) − c_ref^(ℓ)‖ + ε); keep the top K' = 4 layers where the activation lives closer to the refusal manifold.
Adaptive intensity. Layer alignment LAlign^(ℓ) ∈ [0, 1] measures how close the current activation is to the target; intensity λ^(ℓ) = λ₀ · (1 − LAlign^(ℓ)) · Confidence · κ^(ℓ) shrinks to zero as we approach the target manifold.
Steer. Simple scaled addition on the residual stream:

h̃^(ℓ) = h^(ℓ) + λ^(ℓ) · v_t̂^(ℓ) / ‖v_t̂^(ℓ)‖

No optimization, no gradient, no weight changes. Direction is static; magnitude adapts per sample per layer.

Algorithm 1 — SafeConstellations inference pseudocode — **Algorithm 1 — the online inference procedure.** Task detection → confidence gate → dynamic top-K′ layer selection (Eq. 7) → layer alignment (Eq. 5) → adaptive intensity → residual-stream update (Eq. 6). If confidence < τ the whole pipeline exits without touching the model.

Steering intensity schedule across layers — **Adaptive steering schedule.** Same prompt → very different intensity profiles depending on how far the residual stream is from the target centroid. Samples already aligned (LAlign ≈ 0.9) receive nearly zero push; deep refusal samples (LAlign ≈ 0.2) receive a strong, depth-weighted nudge.

UMAP before/after steering — **Steering transformation in UMAP.** The over-refusal cluster is pulled toward the target cluster's manifold while the target cluster remains approximately in place — this is what "constellation-aware steering" looks like in practice.

4. Benchmark: 1,047 prompts × 5 tasks × 9 text types

We stratify a new benchmark for task-conditioned over-refusal evaluation. Every prompt is a task × text combination, drawn from Alpaca, XSTest, JailbreakBench, SaladBench, and a custom RAG-QA corpus. Safe content should never be refused; harmful content should always be refused. The whole point is to measure refusal conditional on benign intent.

Table 1 — dataset composition — **Table 1 of the paper.** 1,047 samples in total, five tasks × nine text types. Benign text (Alpaca, XSTest-safe, RAG-QA, encrypted-safe) must *not* be refused; harmful text (JailbreakBench, SaladBench) must be refused. Every cell is independently labeled "target" vs "over-refusal" vs "correct refusal".

Dataset composition (bar chart) — Same benchmark visualized as a bar chart — stratified 75/25 train/test split reused across every experiment in the paper.

5. Main results

We evaluate four model families on the benchmark before any mitigation:

LLaMA-3.1-8B — the worst offender. Translation: 46.7% over-refusal, Sentiment: 36.4%.
GPT-4o — mostly fine, but significant over-refusal on low-resource-language translation (~36.7%).
Qwen1.5-7B — moderate (Cryptanalysis: 63.3%).
Claude-4-Sonnet — strongly cautious but almost never over-refuses benign tasks.

Table 4 — task-specific over-refusal reduction — **Table 4 of the paper.** Translation on LLaMA-3.1-8B collapses from 46.7% → 8.9% (**81.0%** reduction). Sentiment: 36.4% → 18.2% (50%). Cryptanalysis on Qwen1.5-7B: 63.33% → 43.33% (29.4%). Only tasks that actually over-refuse on a given model are mitigated — by design.

Per-task reduction bar chart — Same numbers as a before/after bar chart for quick scanning.

Table 3 — qualitative analysis — **Table 3 of the paper — qualitative analysis.** Side-by-side outputs for the same benign prompt before and after SafeConstellations — the base model refuses ("I cannot assist with that") while the steered model completes the task correctly.

Paper over-refusal figure — Headline figure from the paper — over-refusal across all evaluated models and prompt types, before any mitigation is applied. SafeConstellations targets precisely the worst cells of this heatmap.

5.1 Mechanistic analysis — why it works

To see why task identity dominates, we visualize late-layer embeddings across every category of harm text at once. Two things are striking: (a) in any single late layer, points cluster by task rather than by content type, and (b) the centroid of each category traces a clean, direction-preserving trajectory across layers — which is exactly what our steering vectors exploit.

UMAP combined trajectories — text type / refusal class / task — **Figure 4 of the paper.** The same residual-stream embeddings visualized three ways — coloured by *text type* (left), by *refusal class* (middle), and by *task* (right). Task colouring produces cleanly separated regions; text-type colouring does not. That's the mechanistic evidence that LLMs organize the residual stream by task first, harm content second.

All layer embeddings across harm categories — **All-layer UMAP across harm categories** (Disinformation, Economic harm, Expert advice, Fraud/Deception, Government decision making, Harassment/Discrimination, Malware/Hacking, Physical harm, Privacy, Sexual/Adult). Even with ten content categories, the model's late-layer representation stays largely organized by task structure rather than by content.

**Centroid trajectories across layers** for each harm category. Every category's centroid walks a coherent path — the raw signal our steering vectors v_t^(ℓ) = c_tar^(ℓ) − c_ref^(ℓ) compress into a usable direction.

Density evolution per label — **Per-label density evolution** across selected layers. Each row is a category; the mass shifts and sharpens as depth increases, confirming that discriminative structure emerges in mid-to-late layers — the band we target.

Combined density across layers — **Combined density across late layers (L37–L40).** Categories separate into distinct modes only in the final few layers — consistent with centroid-distance spiking at the final norm in our quantitative analysis.

Takeaway from the mechanistic section: the steering signal SafeConstellations relies on is not an artifact of one task — it shows up across every harm category we measured. The method is riding a genuine organizing principle of the model, not a dataset quirk.

6. What makes it work? (Ablation, Table 2)

Each SafeConstellations component is load-bearing. Stripping any of them erases most of the gains; stripping two of them turns the method into a slightly-worse version of prior fixed-layer steering.

Table 2 — ablation study — **Table 2 of the paper.** The canonical ablation numbers — full method at the top, each component removed in turn below. Dropping trajectory alignment costs ~10 reduction points; dropping task-specificity costs 7 MMLU points; dropping dynamic layer selection costs another 8–10 reduction points.

Ablation study bar chart — Same Table 2 rendered as a bar chart — full method (dark teal) wins on both LLaMA-3.1-8B and Qwen1.5-7B.

Configuration	OR rate ↓	Reduction ↑	MMLU ↑
SafeConstellations full — LLaMA-3.1-8B	4.81%	72.92%	46.57
SafeConstellations full — Qwen1.5-7B	2.96%	63.64%	28.42
+ Late Layers (26–30) instead of dynamic	6.29%	64.58%	46.57
+ Final Layer only	5.92%	66.67%	46.57
− Trajectory alignment	6.64%	62.50%	46.57
Fixed Layers (intense) — no task specificity	7.03%	60.42%	43.66
Fixed [15, 20, 25, 30] — weak baseline	16.66%	6.25%	39.20
LLaMA-3.1-8B base (unmodified)	17.77%	—	46.57
Qwen1.5-7B base (unmodified)	8.15%	—	28.42

7. Does it bypass safety? (The confidence gate)

This is the most important question: if you can nudge representations away from refusal, can an attacker use the same mechanism to get past the safety alignment? Short answer: no, because the confidence gate fires before any steering happens.

Confidence gate distribution — **The confidence gate is the safety valve.** Jailbreak prompts, harmful instructions, and harmful responses consistently score below τ = 0.85 because they create *mixed representational signals* — part task, part adversarial — which reduce cosine alignment with the clean task centroids. They receive **no steering** and fall through entirely to the base model's safety mechanisms.

Empirical adversarial fallback. Of 48 over-refusal samples, the 7 that received no steering were precisely the samples with the most adversarial structure — jailbreak-type text wrapped in a benign task header. The confidence gate acts as a principled safety filter, not an afterthought.

Task silhouette heatmap — Task × layer silhouette heatmap (rebuttal pack). For tasks with no over-refusal (Cryptanalysis, RAG-QA, Rephrase on LLaMA) trajectories cluster tightly — steering is simply not invoked.

8. Is it fast enough to ship?

All steering is a per-layer scaled vector addition on the residual stream. Task detection is a single forward pass followed by 5 · |𝒯| cosine similarities. The practical overhead is dominated by the task detection pass, ~0.2 s per response.

Latency comparison — **Latency overhead.** Long responses (> 200 tokens) are dominated by decoding — the ≈ 0.2 s hook overhead is invisible. Short responses feel essentially the same. Task Embedding store adds 847 MB for LLaMA-3.1-8B.

9. Takeaways

Tasks live in different regions of the residual stream. That is the organizing principle of modern LLMs, not safety.
Refusal is a late-layer sub-trajectory within each task region, not a global mode. Steer locally, gate globally.
Inference-time representation engineering with a safety gate is a cheap, principled, training-free way to fix over-refusal without hurting utility.
Dynamic layer selection beats fixed layer schedules in every ablation — by 8–10 points of OR reduction.

10. Citation

If you use SafeConstellations or the benchmark in your work, please cite:

@inproceedings{maskey2026safeconstellations,
  title     = {SafeConstellations: Mitigating Over-Refusals in LLMs Through
               Task-Aware Representation Steering},
  author    = {Maskey, Utsav and Yadav, Sumit and Dras, Mark and Naseem, Usman},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for
               Computational Linguistics (ACL)},
  year      = {2026},
  note      = {arXiv:2508.11290}
}

Links: arXiv · PDF · Anonymous code mirror · Benchmark dataset