TL;DR
LLMs refuse benign requests just because the text looks dangerous. SafeConstellations turns refusal mitigation into a representation-engineering problem: we discover that each NLP task traces a stable "constellation" in the residual stream, and the refusal vs non-refusal variants of each task form distinct sub-trajectories around it. At inference time we (i) detect the task from hidden activations, (ii) gate on confidence, and (iii) nudge activations toward the target manifold on a handful of dynamically selected mid-to-late layers. No fine-tuning, no weight updates, ~0.2 s overhead, and up to 73% reduction in over-refusal with no utility loss on MMLU.
1. The over-refusal problem
The pathological case: you ask "Analyze sentiment: How to kill a process" and the model refuses — because the safety mechanism fires on the word kill, not on what you actually asked it to do.
(From the paper thread.)
Safety alignment in modern LLMs is powerful but imprecise. Models have learned to refuse on lexical cues rather than task intent. A customer-support sentiment pipeline gets a review saying "This product killed my productivity" and the model refuses. A translator is handed "How can I kill a Python process?" and returns a safety disclaimer instead of a translation.
Prior work either rewrites prompts (FalseReject, PORPOR), edits preferences (DPO, GRPO on over-refusal data), or steers behavior with a single global direction (Jailbreak Antidote). None of them explicitly model the fact that different tasks live in different regions of the residual stream, and that over-refusal is a task-conditional phenomenon. Our headline claim:
Over-refusal is not a global safety bug — it's a task-specific trajectory defect. If you steer per task, you can fix it surgically.
2. Task identity > input content
We run prompts through a frozen LLM and record the normalized hidden state of the final token at every transformer layer. Plot those trajectories (reduced by UMAP) and one pattern jumps out — the model organizes the residual stream first by task, and only later by behavior:
- Prompts with the same intended task — regardless of whether they were refused or answered — follow similar, consistent paths across layers.
- Within each task, refusal vs non-refusal paths split into two visible sub-trajectories.
- When tasks are mixed together, the refusal/non-refusal distinction becomes noisy. Identity first, behavior second.
This is the constellation hypothesis: the LLM first commits to a task region and only later decides behavior inside that region. We validate it quantitatively with three cluster-separation metrics.
ACL-REBUTTAL/llama/silhouette_scores_all_layers.csv.
rebuttal_metrics_all_layers.csv.3. Method: SafeConstellations
Everything happens at inference time against a frozen model. There are two phases — a one-off offline phase that builds a small "Task Embeddings Store" and an online phase that steers a subset of layers per sample.
- Build a memory bank of task-specific trajectories from training data (offline).
- Identify the task at inference by cosine similarity against those trajectories.
- Dynamically select the handful of layers where the activation lives closest to the refusal manifold.
- Steer representations toward the non-refusal pathway — gated by a confidence threshold τ = 0.85.
3.1 Offline: task-specific trajectories & steering vectors
For each benign task t and each layer ℓ, we compute two centroids from the training split:
ct, ref(ℓ) = mean hidden state over over-refusal samples
vt(ℓ) = ct, tar(ℓ) − ct, ref(ℓ) (steering direction)
We rank layers by an effectiveness score Efft(ℓ) = ‖vt(ℓ)‖ / (σtar + σref + ε) — large separation, tight clusters — and keep the top K = 5 per task in the store 𝓜. Footprint: 847 MB for LLaMA-3.1-8B.
3.2 Online: detect → gate → dynamically select layers → steer
For each incoming prompt x ⊕ t:
- Task detection. Run the forward pass, compute a cos-similarity score against each stored task, pick the best. Confidence = score of the argmax.
- Confidence gate. If confidence < τ (we use τ = 0.85) or the detected task isn't in the benign set, skip steering entirely and let the base model handle it.
- Dynamic layer selection. Compute a potential Pot(ℓ) = ‖h(ℓ) − ctar(ℓ)‖ / (‖h(ℓ) − cref(ℓ)‖ + ε); keep the top K' = 4 layers where the activation lives closer to the refusal manifold.
- Adaptive intensity. Layer alignment LAlign(ℓ) ∈ [0, 1] measures how close the current activation is to the target; intensity λ(ℓ) = λ0 · (1 − LAlign(ℓ)) · Confidence · κ(ℓ) shrinks to zero as we approach the target manifold.
- Steer. Simple scaled addition on the residual stream:
No optimization, no gradient, no weight changes. Direction is static; magnitude adapts per sample per layer.
4. Benchmark: 1,047 prompts × 5 tasks × 9 text types
We stratify a new benchmark for task-conditioned over-refusal evaluation. Every prompt is a task × text combination, drawn from Alpaca, XSTest, JailbreakBench, SaladBench, and a custom RAG-QA corpus. Safe content should never be refused; harmful content should always be refused. The whole point is to measure refusal conditional on benign intent.
5. Main results
We evaluate four model families on the benchmark before any mitigation:
- LLaMA-3.1-8B — the worst offender. Translation: 46.7% over-refusal, Sentiment: 36.4%.
- GPT-4o — mostly fine, but significant over-refusal on low-resource-language translation (~36.7%).
- Qwen1.5-7B — moderate (Cryptanalysis: 63.3%).
- Claude-4-Sonnet — strongly cautious but almost never over-refuses benign tasks.
5.1 Mechanistic analysis — why it works
To see why task identity dominates, we visualize late-layer embeddings across every category of harm text at once. Two things are striking: (a) in any single late layer, points cluster by task rather than by content type, and (b) the centroid of each category traces a clean, direction-preserving trajectory across layers — which is exactly what our steering vectors exploit.
Takeaway from the mechanistic section: the steering signal SafeConstellations relies on is not an artifact of one task — it shows up across every harm category we measured. The method is riding a genuine organizing principle of the model, not a dataset quirk.
6. What makes it work? (Ablation, Table 2)
Each SafeConstellations component is load-bearing. Stripping any of them erases most of the gains; stripping two of them turns the method into a slightly-worse version of prior fixed-layer steering.
| Configuration | OR rate ↓ | Reduction ↑ | MMLU ↑ |
|---|---|---|---|
| SafeConstellations full — LLaMA-3.1-8B | 4.81% | 72.92% | 46.57 |
| SafeConstellations full — Qwen1.5-7B | 2.96% | 63.64% | 28.42 |
| + Late Layers (26–30) instead of dynamic | 6.29% | 64.58% | 46.57 |
| + Final Layer only | 5.92% | 66.67% | 46.57 |
| − Trajectory alignment | 6.64% | 62.50% | 46.57 |
| Fixed Layers (intense) — no task specificity | 7.03% | 60.42% | 43.66 |
| Fixed [15, 20, 25, 30] — weak baseline | 16.66% | 6.25% | 39.20 |
| LLaMA-3.1-8B base (unmodified) | 17.77% | — | 46.57 |
| Qwen1.5-7B base (unmodified) | 8.15% | — | 28.42 |
7. Does it bypass safety? (The confidence gate)
This is the most important question: if you can nudge representations away from refusal, can an attacker use the same mechanism to get past the safety alignment? Short answer: no, because the confidence gate fires before any steering happens.
8. Is it fast enough to ship?
All steering is a per-layer scaled vector addition on the residual stream. Task detection is a single forward pass followed by 5 · |𝒯| cosine similarities. The practical overhead is dominated by the task detection pass, ~0.2 s per response.
9. Takeaways
- Tasks live in different regions of the residual stream. That is the organizing principle of modern LLMs, not safety.
- Refusal is a late-layer sub-trajectory within each task region, not a global mode. Steer locally, gate globally.
- Inference-time representation engineering with a safety gate is a cheap, principled, training-free way to fix over-refusal without hurting utility.
- Dynamic layer selection beats fixed layer schedules in every ablation — by 8–10 points of OR reduction.
10. Citation
If you use SafeConstellations or the benchmark in your work, please cite:
@inproceedings{maskey2026safeconstellations,
title = {SafeConstellations: Mitigating Over-Refusals in LLMs Through
Task-Aware Representation Steering},
author = {Maskey, Utsav and Yadav, Sumit and Dras, Mark and Naseem, Usman},
booktitle = {Proceedings of the 64th Annual Meeting of the Association for
Computational Linguistics (ACL)},
year = {2026},
note = {arXiv:2508.11290}
}Links: arXiv · PDF · Anonymous code mirror · Benchmark dataset