Gavel

Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

Yao Dou, Wei Xu
Georgia Institute of Technology
Paper Data and Code

GAVEL-REF Framework
Our GAVEL-REF framework for evaluating long-context summarization, featuring checklist evaluation (supporting both string-wise and list-wise comparisons), residual fact evaluation, and writing-style evaluation. An interesting finding: many modern LLMs tend to omit specific names of people or organizations. Light green indicates matched values.

About

LLMs can now handle up to 1 million tokens of context. But can they actually do something useful with all that text? We put this to the test with one of the most demanding real-world tasks: legal case summarization.

A single legal case can span dozens of documents—100K to 500K tokens in total. Summarizing these cases requires understanding complex relationships, tracking multiple parties, and capturing both common details (like filing dates) and rare but critical information (like settlement terms).

We introduce GAVEL-REF, an evaluation framework that goes beyond simple aggregate scores. It uses a 26-item checklist to evaluate summaries on specific factual elements, plus separate assessments for residual facts and writing style. We evaluated 12 frontier LLMs on 100 legal cases from 2025, ranging from 32K to 512K tokens.

The results? Even the best model, Gemini 2.5 Pro, achieves only around 50 on SGavel-Ref. Models handle simple items well but struggle with multi-value fields and rare case elements.

Looking ahead, as LLMs improve and potentially surpass human-written references, we developed GAVEL-AGENT—an autonomous agent that navigates case documents using six specialized tools to extract checklist items directly. With Qwen3, it reduces token usage by 36% with only a 7% drop in Schecklist compared to GPT-4.1.

Visualization of Model Summaries

Evaluation of Model Summaries

Gavel-Agent

Visualization of Gavel-Agent

Evaluation of Gavel-Agent