streaming attention (attention sinks) | Od Informacji do Wiedzy

In practical LLM systems (QA, analytics, assistants, agentic RAG), three phenomena routinely degrade quality: (1) Lost in the Middle — accuracy drops when the key evidence sits in the middle of a long prompt; (2) Distracting prompts — a few “tempting” sentences derail reasoning; (3) Very large contexts → performance drop — despite advertised 32k+ windows, results and stability degrade. Below: why this happens, what works “right now,” what to implement in the model/pipeline, and how to measure it rigorously.

TL;DR for the impatient

Instead of stuffing everything into the prompt: retrieval → cross-encoder reranking → compression → extreme ordering (most important at the beginning and end).
Limit distraction with a simple instruction + answer format, few-shot with “noise,” self-consistency, and gating/abstention (NO-RESPONSE) at the passage level.
Stabilise long context via position scaling (LongRoPE/YaRN), a training regime for long sequences (ProLong), test-time adaptation (LIFT), streaming attention with sink tokens and/or external memory.
Measure smartly: not only “needle-in-a-haystack.” Use RULER/ONERULER (also multilingual), multi-needle tests, and real tasks with source citation.

Continue reading →

Od Informacji do Wiedzy

Blog o informacjach na temat informacji i wiedzy

Tag Archives: streaming attention (attention sinks)

Taming Long Context in LLMs: 3 Problems, 1 Cohesive Set of Strategies

TL;DR for the impatient