Tag Archives: streaming attention (attention sinks)

Taming Long Context in LLMs: 3 Problems, 1 Cohesive Set of Strategies

In practical LLM systems (QA, analytics, assistants, agentic RAG), three phenomena routinely degrade quality: (1) Lost in the Middle — accuracy drops when the key evidence sits in the middle of a long prompt; (2) Distracting prompts — a few “tempting” sentences derail reasoning; (3) Very large contexts → performance drop — despite advertised 32k+ windows, results and stability degrade. Below: why this happens, what works “right now,” what to implement in the model/pipeline, and how to measure it rigorously.


TL;DR for the impatient

  • Instead of stuffing everything into the prompt: retrieval → cross-encoder reranking → compression → extreme ordering (most important at the beginning and end).
  • Limit distraction with a simple instruction + answer format, few-shot with “noise,” self-consistency, and gating/abstention (NO-RESPONSE) at the passage level.
  • Stabilise long context via position scaling (LongRoPE/YaRN), a training regime for long sequences (ProLong), test-time adaptation (LIFT), streaming attention with sink tokens and/or external memory.
  • Measure smartly: not only “needle-in-a-haystack.” Use RULER/ONERULER (also multilingual), multi-needle tests, and real tasks with source citation.

Continue reading