A framework for scrutinizing machine text

Notice: In the dataset, the special tokens _SEP_ and _QUOTE_ in the error explanations represent , and " respectively.


Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures.

To facilitate research of these complex error types, we introduce a new structured, crowd-sourced error annotation schema called scarecrow. The error categories used in scarecrow—such as redundancy, commonsense errors, and incoherence—were were identified by combining expert analysis with several pilot rounds of ontology-free crowd annotation to arrive at a schema which covers the error phenomena found in real machine generated text.

We use scarecrow to collect 13k annotations of 1.3k human and machine generate paragraphs of English language news text, amounting to over 41k spans each labeled with its error category, severity, a natural language explanation, and antecedent span (where relevant). We collect annotations for text generated by state-of-the-art systems with varying known performance levels—from GPT-2 Small through the largest GPT-3. We isolate several factors for detailed analysis, including parameter count, training data, and decoding technique. Our results show both expected and surprising differences across these settings. These findings demonstrate the value of scarecrow annotations in the assessment of current and future text generation systems.

Error Types

This table summarizes the 10 error types that annotators choose from to identify problems in text.

Error Type Definition Example
Grammar and Usage This category of errors includes missing words, extra words, and incorrect or out of order words. A PhD student from the University of Kent in the UK, claims to have discovered a clever way to explain the positive emoticons in cats.
Redundant Redundant text repeats itself. Sometimes, you will see the exact word or phrase repeated. Other times, the same idea is repeated using different words. They then made decisions based on Kondo’s instructions, to the extent that they created de-cluttered spaces and got rid of clutter and clutter-filled spaces.
Off-prompt Prompt is a piece of text written by a human that the AI is supposed to continue. Sometimes, however, the AI will write a phrase or sentence that is completely unrelated to the prompt. Other times, the text might be related, but it contradicts the prompt. Prompt: China sets new record for Economic Growth

Text: The Chinese economy fell 10% this month, the third such loss this year.
Self-Contradiction It occurs when the AI writes something that contradicts another piece of text that the AI had previously written. McDonald's is considering a design which will replace the cardboard packaging. Mr Gore-Cotter said: "We recognise the concern around waste. We are now looking at a new design that minimises the plastic bag."
Incoherent The text that doesn't fit into the above categories, but it still just doesn’t make any sense all. Cats naturally show anxiety and fear by at times breaking apart different parts of the brain in an attempt to keep the others from escaping.
Technical Jargon The jargon or specific words from a field you’re not familiar with. In Chile, an 800-megawatt photovoltaic plant was built for a record low cost of $129 per megawatt-hour last year.
Needs Google When there’s a fact or figure that you suspect might be true, but you would need to Google it to be sure. It was promoted by Dr. Michael Fanning, the Executive Director of the Foundation for Mental Health Awareness, Inc. 
Bad Math Bad math includes problems with basic math (+ - ✖️ ÷), problems converting fixed units, and problems converting currencies that are wildly impossible (e.g., 1$ = 10£). One account, @Iain_Rowling1, had over 500,000 followers at one point, but in just four days they fell by around half – some 4,000. 
Commonsense Text that violates our everyday basic understanding of the world. The picture is from high above the South Pole, where close to 100,000 Astronauts live and work.
Encyclopedic Text that are just plain factually wrong where the correct information is written down in a fact table somewhere, like a textbook, a wikipedia sidebar, or an encyclopedia. Japanese Prime Minister Justin Trudeau said he will be halting all imports and exports until the current situation can be contained.

Browse Dataset

We annotate over 1,300 paragraphs with 10 annotators each. To preview an annotation, choose a model and then a prompt. The continuation will appear on the left, and the annotations on the right. You can scroll the annotations window. Select an annotator, then hover over the errors they marked to highlight where they are in the continuation.

example id:

Prompt (human-written):

Continuation written by :

Annotation Tool

Here you can demo the annotation tool used by crowd workers to annotate the dataset. Click and drag on any words in the continuation to trigger the annotation popup. As you make annotations, they will appear below the continuation, where you can interact with them further. (Works on desktop computers only.)

Sytem Prompt (Human Written):

Police, the canine unit and members of the community began searching for Luke.

Continuation (Human or AI Model Written):

The search was called off at around 11 p.m. and resumed at 7 a.m. on Thursday. Police said the search was suspended again at around 9 p.m. on Thursday. Police said the search resumed at 7 a.m. on Friday. Police said the search was suspended again at around 9 p.m. on Friday. Police said the search resumed at 7 a.m. on Saturday. Police said the search was suspended again at around 9 p.m. on Saturday. Police said the search resumed at 7 a.m. on Sunday.


If our work inspires you, please consider citing our paper.

    title={Scarecrow: A Framework for Scrutinizing Machine Text},
    author={Yao Dou and Maxwell Forbes and Rik Koncel-Kedziorski and Noah A. Smith and Yejin Choi},