In practical LLM systems (QA, analytics, assistants, agentic RAG), three phenomena routinely degrade quality: (1) Lost in the Middle — accuracy drops when the key evidence sits in the middle of a long prompt; (2) Distracting prompts — a few “tempting” sentences derail reasoning; (3) Very large contexts → performance drop — despite advertised 32k+ windows, results and stability degrade. Below: why this happens, what works “right now,” what to implement in the model/pipeline, and how to measure it rigorously.
TL;DR for the impatient
- Instead of stuffing everything into the prompt: retrieval → cross-encoder reranking → compression → extreme ordering (most important at the beginning and end).
- Limit distraction with a simple instruction + answer format, few-shot with “noise,” self-consistency, and gating/abstention (NO-RESPONSE) at the passage level.
- Stabilise long context via position scaling (LongRoPE/YaRN), a training regime for long sequences (ProLong), test-time adaptation (LIFT), streaming attention with sink tokens and/or external memory.
- Measure smartly: not only “needle-in-a-haystack.” Use RULER/ONERULER (also multilingual), multi-needle tests, and real tasks with source citation.