CUAD Performance Results & Github Code
are now available
March 11, 2021
The Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of 13,000+ labels in 510 commercial legal contracts with rich expert annotations curated for AI training purposes. The dataset has been manually labeled to identify 41 types of legal clauses in commercial contracts that are considered important in contract review in connection with a corporate transaction, including mergers & acquisitions, corporate finance, investments & IPOs. CUAD and Atticus Labels are licensed under CC BY 4.0.
In collaboration with UC Berkeley AI researchers, we tested CUAD v1 against ten sophisticated pretrained AI models and published the performance results here. Code for replicating the results, together with the model trained on CUAD, is published on Github here.
Data is a major bottleneck in performance
Data size increases performance much more significantly compared to model size alone. See graph below.
This highlights the value of CUAD and the need for other high-quality labeled datasets for legal NLP.
The figures above show that the number of examples in CUAD affects performance decisively, while the size of the model improves performance more steadily. The figure on the right demonstrates that having only 1,000 training annotations is not enough data, but having approximately 10,000 training annotations starts to greatly improve performance.
Precision at Recall is a better metric for evaluating contract review AI
Contract review is a task of "finding needles in a haystack." Clauses labeled as responsive to an Atticus Label make up on average about 10% of each contract, and 0.25% of each contract for each Atticus Label.
Precision at Recall is a better metric for "finding needles in a haystack".
We use the following example to explain this metric.
500 contracts contain 5 million clauses, 100 of which are Liquidated Damages clauses.
An AI model that achieves 20% precision@80% recall is able to correctly identify 80 of the 100 Liquidated Damages clauses out of the 5 million (80 = 80% recall *100).
However, in addition to the 80 correct ones, the model also identifies 420 false Liquidated Damages clauses, hence requiring a human to go through all 500 clauses to find the 80 correct ones (500 = 100 / 20% precision).
In summary, recall prevents misses, and precision prevents false positives. There is a tradeoff between recall and precision. A useful model for contract review needs to have high recall because "finding needles in a haystack" is the time-consuming and error-prone task that we need to automate.
This figure shows precision at 80% recall for the various types of clauses. Precision varies significantly based on the type of clause.
Contract Review AI is advancing quickly
AI models for contract review are nascent and promising.
CUAD performance on a model in 2021 is 4x higher than its counterpart in 2018. See figure below.
Atticus remains committed to supercharge Legal AI by open-sourcing high-quality labeled dataset of legal contracts.
This graph shows the average performance of different AI models across model release years.
The average performance of BERT models (released in 2018) is less than 10%, but the average performance of DeBERTa-v2 models (released in 2021) is over 40%. That is 4x increase in just 3 years.
How can you help?
Be a part of the change by becoming an Atticus Open Data Fellow.