Gavel

Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

Yao Dou, Wei Xu
Georgia Institute of Technology
Paper Data and Code

GAVEL-REF Framework
Our GAVEL-REF framework for evaluating long-context summarization, featuring checklist evaluation (supporting both string-wise and list-wise comparisons), residual fact evaluation, and writing-style evaluation. An interesting finding: many modern LLMs tend to omit specific names of people or organizations. Light green indicates matched values.

About

LLMs can now handle up to 1 million tokens of context. But are they really that good? We put this to the test with one of the most demanding real-world tasks: legal case summarization.

A single legal case can span dozens of documents—100K to 500K tokens in total. Summarizing these cases requires understanding complex relationships, tracking multiple parties, and capturing both common details (like filing dates) and rare but critical information (like settlement terms).

We introduce GAVEL-REF, an evaluation framework that goes beyond simple aggregate scores. It uses a 26-item checklist to evaluate summaries on specific factual elements, plus evaluations for residual facts and writing style. We evaluated 12 frontier LLMs on 100 legal cases from 2025, ranging from 32K to 512K tokens.

The results? Even the best model, Gemini 2.5 Pro, achieves only around 50 on SGavel-Ref. Models handle simple items well but struggle with multi-value fields and rare case elements.

Looking ahead, as LLMs improve and potentially surpass human-written references, we developed GAVEL-AGENT—an autonomous agent that navigates case documents using six specialized tools to extract checklist items directly. With Qwen3, it reduces token usage by 36% with only a 7% drop in Schecklist compared to end-to-end processing with GPT-4.1.

Visualization of Model Summaries

Compare model-generated summaries against human reference summaries, with detailed checklist evaluation showing how well each model captures specific checklist items. Note: Some data or evaluation results may be missing due to server issues. We are working on a fix.

Model Summary

Select a model and case to view the summary.

Reference Summary

Select a model and case to view the reference.

Checklist Evaluation

Select a model and case to view checklist evaluation.

Evaluation of Model Summaries

We evaluated 12 frontier LLMs using GAVEL-REF across 100 legal cases, primarily from 2025, spanning 32K to 512K tokens.

GAVEL-REF evaluation heatmap for 12 LLMs
GAVEL-REF evaluation results across different case length bins. Higher scores indicate better performance.

Gemini 2.5 Pro achieves the best overall performance with an SGAVEL-REF of 51.0, followed by Claude Sonnet 4 and Gemini 2.5 Flash. Proprietary models consistently outperform open-source alternatives by a clear margin—the best open-source model, GPT-oss 20B, reaches only 45.9.

All models degrade as case length increases. Even models supporting 1M-token context windows show noticeable drops on longer cases. Gemini 2.5 Pro scores 4.7 points lower on 512K cases compared to 32K cases, while GPT-4.1 drops by 7.6 points.

Surprisingly, GPT-5 has the lowest writing-style rating (Sstyle = 59.1), while Claude and Gemini models score highest at 71.0. GPT-5 tends to produce verbose, checklist-style summaries instead of the requested narrative form—sometimes generating nearly 1,000 words when human summaries average 700. On longer cases, all models produce summaries significantly shorter than human references.

Extract Checklist Directly from Case Documents

Reference-based evaluation requires hours of expert time per case and cannot serve as a long-term gold standard once LLMs surpass humans. Directly extracting checklists from case documents enables scalable evaluation and inference-time suggestions. We experiment with three methods.

End-to-End

Concatenate all case documents chronologically and feed them to long-context LLMs. Each of the 26 checklist items is queried individually for better accuracy.

Chunk-by-Chunk

Split documents into 16K-token chunks that fit modern context windows. Process iteratively—at each step, the model receives the chunk and current checklist state, then outputs updates.

Visualization of Document Extracted Checklist

Compare model-extracted checklists from case documents against human-extracted checklists from summaries. This visualization shows how well each extraction method captures the 26 checklist items directly from source documents. Note: Some data or evaluation results may be missing due to server issues. We are working on a fix.

Human Reference Summary

Select a method and case to view the summary.

Checklist Comparison: Model (from Documents) vs Reference (from Summary)

Select a method and case to view checklist comparison.

Evaluation of Document Extracted Checklist

We evaluated extraction quality on 40 long cases, comparing each method's extracted checklist against the human-created checklist from the summary.

End-to-end extraction with GPT-4.1 achieves the highest Schecklist of 46.9, but consumes 4.4M tokens—the most expensive approach.

GAVEL-AGENT with 26 individual agents using Qwen3 30B-A3B achieves the second-best score of 43.5 while using only 2.8M tokens—36% fewer tokens than end-to-end. Within GAVEL-AGENT configurations, multi-agent decomposition proves better suited for long-horizon extraction than a single agent handling many items at once.

Chunk-by-chunk performs worst at 38.8, largely due to error accumulation in iterative updates where incorrect values persist and lead to over-extraction.

Notably, all document extraction methods fall well below the 68.2 achieved by GPT-5 extracting from human summaries, showing significant headroom for improving both long-context models and long-horizon agents.

Evaluation of document extraction methods
Schecklist vs. total token usage for each extraction method across 40 cases.

Citation

@article{dou2026gavel,
  title={Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization},
  author={Dou, Yao and Xu, Wei},
  journal={arXiv preprint arXiv:2601.04424},
  year={2026}
}