Contract Understanding Atticus Dataset
(CUAD) v1 is available for download.
March 1, 2021
Today we release the Contract Understanding Atticus Dataset (CUAD) v1. CUAD v1 is a corpus of 13,000+ labels in 510 commercial legal contracts with rich expert annotations curated for AI training purposes. The dataset has been manually labeled under the supervision of experienced attorneys to identify 41 types of legal clauses in commercial contracts that are considered important in contract review in connection with a corporate transaction, including mergers & acquisitions, corporate finance, investments & IPOs. CUAD and Atticus Labels are licensed under CC BY 4.0.
The beta version of CUAD was released in October 2020 under the name AOK v.1 with 200 contracts. CUAD v1 has 510 contracts and includes both PDF and TXT versions of the full contracts, one master CSV file and 27 Excel files corresponding to one or more types of legal clauses to ensure ease of use by both developers and attorneys.
Data is the bottleneck for contract review AI
In collaboration with UC Berkeley AI researchers, we tested CUAD v1 against nine sophisticated pretrained language models. The results validated our assumption that high-quality labeled dataset is the bottleneck for contract review AI. Publication with detailed results for each Atticus Label will be available in the coming weeks.
The figures above show that the number of examples in CUAD affects performance decisively, while the size of the model improves performance more steadily. The figure on the right demonstrates that having only 1,000 training annotations is not enough data, but having approximately 10,000 training annotations starts to greatly improve performance.
Atticus's Mission to solve the data problem continues
Although the overall performance of the models is relatively high overall, it varies substantially by the type of clauses. More details will come in the coming weeks.
Atticus's mission to remove the bottleneck data problem continues. Our next release will:
double the size of CUAD v.1; and
focus on data for clauses with lower performance scores.
To scale our efforts, we are launching the Atticus Open Data Fellow Programs in collaboration with Berkeley Center for Law and Business and LexLab at UC Hastings Law.
The future is in the hands of lawyers: illustrative use cases
The availability of accurate contract review AI tool depends on the lawyers to:
label more contracts; and
inform the tech community of legal use cases.
Use Case #1: Disclosure Schedules
Disclosure schedules in an M&A transaction contain a list of contracts that are responsive or exceptions to the seller's representations and warranties. Disclosure schedules can be hundreds of pages long and have been predominantly created by human typing.
An AI tool that can accurately identify Document Name, Agreement Date and Parties, coupled with simple code, can save hours of attorney time and enable speedy delivery of high-quality work product.
Use Case #2: Divested Contracts
In a divestiture, the parent company needs to transfer contracts for the divested business to the buyer. Determining which contracts are for the divested business can be automated by an AI tool that can accurately identify Parties, names of the signing entities and divested products.
Use Case #3: "Uncommon" Clauses
Some clauses are so rare in legal contracts that even the most experienced attorneys may only encounter a few in their careers. CUAD v1 contains a large number of these rare clauses (e.g. ~35 MFN clauses and ~180 Exclusivity) that can be used to supplement proprietary training datasets. In addition to one master CSV file, CUAD v1 contains 28 Excel files corresponding to one or more Atticus Labels (MFN, Restrictive Covenants, etc.) to ensure ease of use by developers and attorneys.