top of page

The Atticus White Paper

March 2021

(as of March 202, and maybe updated from time to time)


The Atticus Project is a grassroots movement convinced that trustworthy AI systems will shape the future of the legal industry. Artificial intelligence has the potential to disrupt every aspect of human life as we know it. The OECD asserts that AI is “(…) reshaping economies, promising to generate productivity gains, improve efficiency and lower costs. It contributes to better lives and helps people make better predictions and more informed decisions”. The AI revolution is fueled by data-driven innovation, which requires large amounts of high-quality data to make better predictions and improve decision-making processes. In Tomorrow’s Lawyers, Richard Susskind predicts that the legal industry will be disrupted by what he calls “legal technologies” such as document automation, e-learning, legal open-sourcing, closed legal communities, workflow and project management, document analysis, machine prediction, and legal question answering.

We believe that contract review is ripe for disruption by AI and data-driven innovation. The successful design, development, and deployment of high-quality and accurate trustworthy AI systems for the legal industry are conditioned by the quality of labeled and annotated data. This problem can only be solved by the legal industry and its domain expertise to label, annotate and curate high-quality datasets. The Atticus Project is determined to stand at the frontier of the legal AI industry by filling in the blank with its crowdsourced domain expertise. We believe that lawful, ethical, and robust AI systems can help improve speed, efficiency, and consistency in repetitive or mechanical tasks usually undertaken by junior associates in a law firm or the in-house legal departments. Expensive legal resources can be liberated from these tasks to focus on the critical analysis and strategic advisory services that the clients need.

The Atticus Project; Labels, Datasets, and Documentation

The purpose of the Atticus Project is aimed at accelerate the development and deployment of AI systems for the legal industry. The Atticus Project will encourage the best practices and industry standards regarding openness, algorithmic transparency, control, accountability, and explicability to address AI’s governance concerns. Within this context, the first stage of The Atticus Project focuses on curating, labeling and open-sourcing a set of high-quality crowdsourced contract clauses and open-sourced datasets of legal contracts.

CUAD (Contract Understanding Atticus Dataset) include the following:

Atticus Labels: The Atticus Labels are comprised of a list of provisions that attorneys look for in a contract during their contract review. Our human nature entails that attorneys look for different contract provisions for different purposes. The published Atticus Labels focuses on provisions that our legal experts deem to be the most relevant in a due diligence review associated with a transaction. This transaction can be M&A, a venture financing or an initial public offering (IPO). The Atticus Labels are divided into the following three categories:

  • General information includes terms such as party names, document names, dates and renewal terms. It may also include the number of options for option agreements or scope of license for IP licenses.

  • Restrictive covenants include terms such as most-favored-nation and exclusivity clauses. These are considered some of the most troublesome provisions because they restrict the buyer’s or the company’s ability to operate the business after a transaction.

  • Revenue risks include terms such as minimum commitment or change of control clauses that may require the buyer or the company to incur additional cost or take remedial measures after the closing of a transaction.

The Atticus Labels are open-sourced under CC BY 4.0 and free to the public. The Atticus Project aims to leverage AI to automate the "finding-needles-in-a-haystack" portion of the contract review process. We believe that the task of the AI is to find these provisions in a large volume of contracts, whereas analyzing the provisions and conducting risk assessment should be done by human attorneys who are familiar with the business and transactional context.

Contracting Understanding Atticus Dataset (CUAD): CUAD is a corpus of legal contracts together with provisions in each contract that correspond to the Atticus Labels. The provisions are labeled or tagged by Atticus volunteers at leading law schools and quality-checked by experienced attorneys at leading in-house legal departments and law firms. The Dataset include an answer-key file in csv, json and xlsx formats and the underlying contracts in PDF and TXT files. CUAD v1 includes 510 contracts with over 13,000 labels. Future releases will include more contracts and additional types of contract provisions. 

CUAD is comprised of solely contracts from the Electronic Data Gathering, Analysis, and Retrieval system (“EDGAR”) maintained by the U.S. Securities and Exchange Commission (SEC). Publicly traded and other reporting companies are required by the SEC rules to file certain types of contracts with the SEC through EDGAR. Access to EDGAR documents is free and open to the public. We fully acknowledge that EDGAR is not the most ideal dataset of legal contracts because it is not representative of the overall population of legal contracts. Because only material contracts are required to be filed with the SEC, the EDGAR contracts are more complicated and heavily negotiated than the general population of all legal contracts. We are aware of this inherent bias. However, EDGAR contracts have the advantage of containing a large sample of provisions that are hard to find in the general population. One company may have only one or two contracts that contain exclusivity clauses. EDGAR contracts may have hundreds of them. It also has the benefit of being publicly available. The Atticus Dataset, when completed by the end of 2020, will be open-sourced and free to the public.

Contract review is an integral part of highly confidential transactions such as M&A, IPO or financing. Accordingly, The Atticus Project has a great interest in promoting the development of trustworthy AI tools that protects confidentiality and privacy while enhancing efficiency and accuracy.

Trustworthy AI for the Legal Industry
The European Union High-Level Expert Group on Artificial Intelligence released an Ethics Guidelines for Trustworthy AI. The EU-HLEG suggests that Trustworthy AI ought to comply with three criteria: 1) AI should be lawful, complying with all applicable laws and regulations, 2) AI should be ethical, complying with ethical principles, 3) AI should be technically and socially robust to avoid causing unintentional harm. Based on these recommendations, the EU-HLEG advises that investment in AI systems should be oriented towards the development of trustworthy AI. Similarly, the OECD released a set of recommendations for the design, development, and deployment of AI systems.
The Atticus Project wants to actively take part in this global effort by developing a set of protocols to help build trustworthy AI systems for the legal industry. The Atticus protocols aim to promote openness, transparency and portability that are key to attract top-notch AI talents to enter the AI Legal market and supercharge innovation. 

Building Data Trust: A Proposed Data Governance Framework for Legal Tech
Every AI system that processes Atticus Dataset to develop a contract review tool ought to strive to follow GDPR, CCPA, and other relevant privacy regulations to guarantee that Atticus and its members retain control over their data. Data ownership poses one of the major challenges of AI and data-driven innovation. These issues may be relatively easy to address at this stage because CUAD is limited to publicly available contracts, but it will be harder if the needs of the project quickly escalates and would require the use of non-public datasets.
The use of non-public dataset will impose many technical and legal challenges. However, we are convinced that data anonymization or de-identification ought to create a path for data or algorithm portability to encourage innovation, competition, and the development of new LegalTech services for the legal industry. We also believe, furthermore, that portability is not incompatible with ownership over non-public dataset insofar as the participants have devoted their resources and their in-house legal departments’ expertise to carefully build their legal documents and internal processes over time.
We consider that the combination of different IP and data protection tools could allow Atticus to create portable datasets, to retain, and assert ownership over them and the derivates (models and algorithms developed by the AI vendor that processes or mines CUAD). We believe, furthermore, that this data governance framework could give Atticus full control over its datasets derivatives and help prevent IP disputes with AI vendors or amongst participants, present or future. 

Ethics by Design and by Default
Mistrust on AI systems is flagged by many in-house counsels as one of the major obstacles for their adoption in the legal industry. According to the Thomson Reuters report on AI, the legal industry considers that the lack of transparency, explainability, accountability, and control over AI systems cast doubt on how these systems process confidential information, provide an explanation about how they produce an outcome, and who should be held accountable for the mistakes made by the machine.
The EU-HLEG on AI suggests that Ethical AI "(…) can improve individual flourishing and collective wellbeing by generating prosperity, value creation and wealth maximization. It can contribute to achieving a fair society, by helping to increase citizens' health and well-being in ways that foster equality in the distribution of economic, social and political opportunity". To fulfill these ideals, the EU-HLEG on AI indicates that trustworthy AI should be rooted upon four ethical principles: 1) Respect for human autonomy, 2) Prevention of harm, 3) Fairness, and 4) Explicability.
These four ethical principles are critical for the design, development, and deployment of AI systems for the legal industry. The Atticus Project considers that these concerns can be solved by including four minimum ethical requirements in the Atticus Protocols that any AI system that processes Atticus datasets ought to comply by design and by default:

  • Transparency.

  • Accountability.

  • Control.

  • Explicability.

In developing said machine ethical principles, AI vendors should provide the Atticus Project information about:

  • The logic involved and envisaged consequences of the AI system.

  • General functionality of the AI system, model and its specifications.

  • Nature and extent of human intervention in the automated decision-making process.

  • Rationale, factors, reasons, and circumstances of a specific automated decision.

The black box explanation problem remains unresolved, which is usually framed as an intelligibility problem that defies human intuition (how does the machine work? and why it delivers counterintuitive out-puts?). The Atticus Project won’t be able to solve it, but we believe these measures could help bring some practical insights to the conversation:
A. Participants will have access to the data lineage and how it has been curated by expert in-house legal departments.
B. Participants will be aware of the degree of human intervention during the AI system’s cycle.
C. Participants will have a pragmatic understanding of the logic involved in the model.
D. Upon request, participants may be able to know the rationale, reasons, and circumstances of a specific decision.
This is critical for developing an ethical and trustworthy AI system and for improving the accuracy and recall of dynamic models. Atticus might not be in a position to solve AI's intelligibility problem right now (how does the machine work?) but it will certainly be in a strategic position to encourage best ethical practices that may lead to solving the explainability problem in Legal Tech (how did the machine reach a decision? Where is the data coming from? Is there a human involved and to what extent (human-in-the-loop, human-on-the-loop, or human-in-command)? Who should be held accountable? How can we improve the model? and so on).
In other words, we suggest that the combination of high-quality crowdsourced Atticus data sets and ethics by design and by default requirements should allow The Atticus Project to obtain a reasonable explanation of the decision reached by the AI systems that process CUAD. We will carefully document this process and update this paper to share what we learn with the general public.
Participation in the Atticus Project
Participation in the Atticus Project means an opportunity to pioneer a transformation in the legal industry by shaping the development of AI tools for legal contract review. Reference to How you can help here for more information.


  1. Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake our World (2015); Michael I. Jordan, Artificial Intelligence: The Revolution that hasn’t Happened Yet, 1 Harvard Data Science Review (2019); Greg Shaw, The Future Computed: AI & Manufacturing, Microsoft Corporation, (2019).

  2. OECD (2019), Artificial Intelligence in Society, OECD Publishing, Paris,

  3. OECD, ‘Data-driven Innovation: Big Data for Growth and Well-being’ OECD Publishing, 10 (2014); HM Treasury, The Economic Value of Data: Discussion Paper, Crown, 2018; McKinsey Global Institute, Artificial Intelligence: The Next Digital Frontier?, Discussion Paper (2017).

  4. Thomson Reuters, “Ready or Not: Artificial Intelligence and Corporate Legal Departments”, Legal Departments 2025 series:

  5. Richard Susskind, Tomorrow’s Lawyers: An Introduction to Your Future, Oxford (2ed., 2019).

  6. Ken Goldberg & Vinod Kumar, Cognitive Diversity: AI & The Future of Work, Tata Communications, (2018).

  7. OECD, Recommendations of the Council on Artificial Intelligence, OECD/LEGAL/0449; High-Level Expert Group on Artificial Intelligence, Ethics guidelines for trustworthy AI, 9 (2019):; High-Level Expert Group on Artificial Intelligence, Policy and Investment Recommendations for Trustworthy AI (2019):

  8. Thomson Reuters, “Ready or Not: Artificial Intelligence and Corporate Legal Departments”, Legal Departments 2025 series:

  9. High-Level Expert Group on Artificial Intelligence, Policy and Investment Recommendations for Trustworthy AI (2019):

  10. OECD, Recommendations of the Council on Artificial Intelligence, OECD/LEGAL/0449.

  11. GDPR Recital 26.

  12. Cal. Civ. Code §§ 1798.140(a), (h), (o), (r), and 1798.145(a)(5).

  13. Article 29 Data Protection Working Party, Opinion 05/2014 on Anonymization Techniques, 0829/14/ENWP216, (2014).

  14. Vid. HM Treasury, The Economic Value of Data: Discussion Paper, Crown, 2018.

  15. Thomson Reuters, “Ready or Not: Artificial Intelligence and Corporate Legal Departments”, Legal Departments 2025 series:

  16. High-Level Expert Group on Artificial Intelligence, Ethics guidelines for trustworthy AI, 9 (2019):

  17. Ibid. at 11.

  18. OECD (2019), Artificial Intelligence in Society, OECD Publishing, Paris,; OECD, Recommendations of the Council on Artificial Intelligence, OECD/LEGAL/0449.

  19. Vid. Article 29 Data Protection Working Party, Guidelines On Automated Individual Decision-Making And Profiling For The Purposes Of Regulation 2016/679, 17/EN. WP 251rev.01 (Feb. 6, 2018); Sandra Wachter, Brent Mittelstadt & Luciano Floridi, Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation, 7 INT’L DATA PRIVACY L. 76 (2017); Andrew D. Selbst & Solon Barocas, The Intuitive Appeal of Explainable Machines, 87 Fordham. L. Rev. 1085 (2018). Margot E. Kaminski, The Right to Explanation, Explained, 34 Berkeley Tech. L. J. 189 (2019).

bottom of page