ABOUTMultiPIT is the largest Twitter-based paraphrase corpus to-date. It contains four parts: MultiPITcrowd, MultiPITexpert, MultiPITAuto, MultiPITNMR. MultiPITcrowd is a collection of crowdsourcing annoations with loosely defined paraphrase definitions. MultiPITexpert is a collection of expert annotations with strict defined paraphrase definitions. MultiPITAuto is a collection of automatically identified paraphrases pairs from recent Twitter data. MultiPITNMR is the first multi-reference test set for parpahrase generation.
PAPERImproving Large-scale Paraphrase Acquisition and Generation EMNLP 2022
Authorsfrom Georgia Institute of Technology
DATA (available now)100K+ crowdsourcing annotations 5K+ expert annotations 500K+ automatic annotations 200 × 8 expert annotations
CODE (coming soon...)
Acknowledgement: This material is based in part on research sponsored by IARPA via the BETTER program (contract 19051600004).
|Rank||Metric||Referenceless||Fluency Correlation||Semantic Similarity Correlation||Diversity Correlation||Overall Correlation|