Gavel

About

LLMs can now handle up to 1 million tokens of context. But are they really that good? We put this to the test with one of the most demanding real-world tasks: legal case summarization.

A single legal case can span dozens of documents—100K to 500K tokens in total. Summarizing these cases requires understanding complex relationships, tracking multiple parties, and capturing both common details (like filing dates) and rare but critical information (like settlement terms).

We introduce GAVEL-REF, an evaluation framework that goes beyond simple aggregate scores. It uses a 26-item checklist to evaluate summaries on specific factual elements, plus evaluations for residual facts and writing style. We evaluated 12 frontier LLMs on 100 legal cases from 2025, ranging from 32K to 512K tokens.

The results? Even the best model, Gemini 2.5 Pro, achieves only around 50 on S_Gavel-Ref. Models handle simple items well but struggle with multi-value fields and rare case elements.

Looking ahead, as LLMs improve and potentially surpass human-written references, we developed GAVEL-AGENT—an autonomous agent that navigates case documents using six specialized tools to extract checklist items directly. With Qwen3, it reduces token usage by 36% with only a 7% drop in S_checklist compared to end-to-end processing with GPT-4.1.

Visualization of Model Summaries

Compare model-generated summaries against human reference summaries, with detailed checklist evaluation showing how well each model captures specific checklist items. Note: Some data or evaluation results may be missing due to server issues. We are working on a fix.

Model:

Case:

View case on Clearinghouse

Model Summary

Select a model and case to view the summary.

Reference Summary

Select a model and case to view the reference.

Checklist Evaluation

Select a model and case to view checklist evaluation.

Evaluation of Model Summaries

We evaluated 12 frontier LLMs using GAVEL-REF across 100 legal cases, primarily from 2025, spanning 32K to 512K tokens.

GAVEL-REF evaluation heatmap for 12 LLMs — GAVEL-REF evaluation results across different case length bins. Higher scores indicate better performance.

Gemini 2.5 Pro achieves the best overall performance with an S_GAVEL-REF of 51.0, followed by Claude Sonnet 4 and Gemini 2.5 Flash. Proprietary models consistently outperform open-source alternatives by a clear margin—the best open-source model, GPT-oss 20B, reaches only 45.9.

All models degrade as case length increases. Even models supporting 1M-token context windows show noticeable drops on longer cases. Gemini 2.5 Pro scores 4.7 points lower on 512K cases compared to 32K cases, while GPT-4.1 drops by 7.6 points.

Surprisingly, GPT-5 has the lowest writing-style rating (S_style = 59.1), while Claude and Gemini models score highest at 71.0. GPT-5 tends to produce verbose, checklist-style summaries instead of the requested narrative form—sometimes generating nearly 1,000 words when human summaries average 700. On longer cases, all models produce summaries significantly shorter than human references.

Extract Checklist Directly from Case Documents

Reference-based evaluation requires hours of expert time per case and cannot serve as a long-term gold standard once LLMs surpass humans. Directly extracting checklists from case documents enables scalable evaluation and inference-time suggestions. We experiment with three methods.

End-to-End

Concatenate all case documents chronologically and feed them to long-context LLMs. Each of the 26 checklist items is queried individually for better accuracy.

Chunk-by-Chunk

Split documents into 16K-token chunks that fit modern context windows. Process iteratively—at each step, the model receives the chunk and current checklist state, then outputs updates.

GAVEL-AGENT

Our autonomous agent framework lets LLMs strategically search and skim documents rather than reading every word—mimicking how humans extract information.

Six Specialized Tools

list_documents() — Get case overview with document metadata
read_document(doc, start, end) — Read up to 10K tokens from a document
search_document_regex(pattern) — Search with regex, get matches with context
get_checklist(items) — Retrieve currently extracted values
append_checklist(patch) — Add new values with supporting evidence
update_checklist(patch) — Replace values or mark as N/A

Smart Context Management

After each action, the context refreshes with a snapshot of explored documents and recent actions. The 5 most recent tool calls include full responses; older ones are compressed to brief outcomes (e.g., "read 3,000 tokens"). This keeps prompts compact even on 256K+ token cases with 50+ tool calls.

Fully Customizable

Users can define any checklist items, making it easy to adapt to other domains beyond legal summarization.

Visualization of Document Extracted Checklist

Compare model-extracted checklists from case documents against human-extracted checklists from summaries. This visualization shows how well each extraction method captures the 26 checklist items directly from source documents. Note: Some data or evaluation results may be missing due to server issues. We are working on a fix.

Method:

Case:

View case on Clearinghouse

Human Reference Summary

Select a method and case to view the summary.

Checklist Comparison: Model (from Documents) vs Reference (from Summary)

Select a method and case to view checklist comparison.

Evaluation of Document Extracted Checklist

We evaluated extraction quality on 40 long cases, comparing each method's extracted checklist against the human-created checklist from the summary.

End-to-end extraction with GPT-4.1 achieves the highest S_checklist of 46.9, but consumes 4.4M tokens—the most expensive approach.

GAVEL-AGENT with 26 individual agents using Qwen3 30B-A3B achieves the second-best score of 43.5 while using only 2.8M tokens—36% fewer tokens than end-to-end. Within GAVEL-AGENT configurations, multi-agent decomposition proves better suited for long-horizon extraction than a single agent handling many items at once.

Chunk-by-chunk performs worst at 38.8, largely due to error accumulation in iterative updates where incorrect values persist and lead to over-extraction.

Notably, all document extraction methods fall well below the 68.2 achieved by GPT-5 extracting from human summaries, showing significant headroom for improving both long-context models and long-horizon agents.

Evaluation of document extraction methods — S_checklist vs. total token usage for each extraction method across 40 cases.

Citation

@article{dou2026gavel,
  title={Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization},
  author={Dou, Yao and Xu, Wei},
  journal={arXiv preprint arXiv:2601.04424},
  year={2026}
}