SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering (ACL 2026)
SafeConstellations — ACL 2026 Main Conference
A deep-dive companion blog for SafeConstellations, accepted at ACL 2026 Main. An inference-time method that reduces LLM over-refusal by up to 73% by steering task-specific trajectories in the residual stream — no fine-tuning, ~0.2s overhead, zero utility loss on MMLU.
Read the full interactive blog →
TL;DR
LLMs refuse benign requests just because the text looks dangerous ("Analyze sentiment: How to kill a process" → refusal). SafeConstellations turns refusal mitigation into a representation-engineering problem:
- Each NLP task traces a stable constellation in the residual stream.
- Refusal vs non-refusal variants form distinct sub-trajectories within each task.
- At inference we detect the task, gate on confidence (τ = 0.85), and nudge activations toward the target manifold on a handful of dynamically selected mid-to-late layers.
Headline Numbers
| Metric | Value |
|---|---|
| Max over-refusal reduction | 73% |
| MMLU utility | 46.57 → 46.57 (no loss) |
| Added latency / response | ~0.2 s |
| Training / weight updates | 0 |
Paper
- Title: SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering
- Authors: Utsav Maskey, Sumit Yadav, Mark Dras, Usman Naseem
- Venue: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference
- arXiv: 2508.11290
- Thread: @Rocker_Ritesh on X
What's Inside the Blog
The full interactive write-up at /study/stellar-steering/ walks through:
- Motivation — why LLMs over-refuse on task keywords like "kill", "exploit", "crack"
- Constellation hypothesis — task identity dominates the residual stream; behavior is a sub-structure
- Method — offline task-embedding memory bank + online task detection + dynamic layer selection + confidence-gated steering
- Results — 270-sample test split, LLaMA-3.1-8B and Qwen1.5-7B, per-task reductions
- Ablations — which components matter (confidence gate, layer selection, per-task vs global direction)
- Safety — no weakening of harmful-request refusal; utility preserved
- Latency — ~0.2s per response, 847 MB memory bank footprint
Why This Matters
Over-refusal is not a global safety bug — it's a task-specific trajectory defect. Steer per task, and you fix it surgically. No DPO, no GRPO, no prompt rewriting, no fine-tuning. Just activation geometry.
Cite This Post
Auto-generated. Verify for your institution's requirements.