LLMs can now handle up to 1 million tokens of context. But can they actually do something useful with all that text? We put this to the test with one of the most demanding real-world tasks: legal case summarization.
A single legal case can span dozens of documents—100K to 500K tokens in total. Summarizing these cases requires understanding complex relationships, tracking multiple parties, and capturing both common details (like filing dates) and rare but critical information (like settlement terms).
We introduce GAVEL-REF, an evaluation framework that goes beyond simple aggregate scores. It uses a 26-item checklist to evaluate summaries on specific factual elements, plus separate assessments for residual facts and writing style. We evaluated 12 frontier LLMs on 100 legal cases from 2025, ranging from 32K to 512K tokens.
The results? Even the best model, Gemini 2.5 Pro, achieves only around 50 on SGavel-Ref. Models handle simple items well but struggle with multi-value fields and rare case elements.
Looking ahead, as LLMs improve and potentially surpass human-written references, we developed GAVEL-AGENT—an autonomous agent that navigates case documents using six specialized tools to extract checklist items directly. With Qwen3, it reduces token usage by 36% with only a 7% drop in Schecklist compared to GPT-4.1.