ContractKen AI tech: Under the hood
Updated: Aug 25, 2022
In this article, we’d talk about the technology under the ContractKen hood.
Let us start with opening a sample contract, in MS Word:
Here is a Merger Agreement between two media companies, focused on a variety of issues, transactions, etc. This is a massive document spanning 82+ pages, not including a large number of exhibits & schedules.
Typically, the execution of such a contract is the result of months, if not years of contracting work between all parties involved. Easy to imagine the amount of drafting, review, and iterations that such a large agreement would take.
Reviewing such a large contract is surely not for the weak-hearted or the impatient! This is where an area of AI called Natural Language Processing (NLP) steps in.
Challenges of using NLP for contract reviews
Contracts are unstructured, unstandardized, and use nuanced legal language. Take a look at the below example of two clauses having very similar language but diametrically opposite meaning / implications:
During the Term and for a period of two years thereafter, or for a period of seven years from the date of creation of the Records (whichever is longer) the Supplier shall keep full, true and accurate Records to show compliance with its obligations under this Agreement together with any other records that are required by any professional rules of any regulatory body which apply to the activities of the Supplier or as may from time to time be agreed in writing between the Company and the Supplier.
During the Term and for a period of two years thereafter, or for a period of seven years from the date of creation of the Records (whichever is longer) the Supplier shall keep full, true and accurate Records to show compliance with its obligations under this Agreement together with such other records as may from time to time be agreed in writing between the Company and the Supplier.
Contracts exist to guard against rare and potentially catastrophic occurrences, so tolerance for false negatives and false positives is almost nil
A contract document is not a simple collection of paragraphs of text whose individual inferences can be summed up to overall understanding. Instead, it is a carefully constructed instrument of risk management, where implications of concepts, terms & clauses are dependent on other concepts, terms & clauses, or even other contracts.
Experts have a way of reviewing pieces of contract, cross-referencing, triaging and then concluding the risks presented in a clause. Traditional ML algorithms process a document or a piece and are not suited for this iterative, interlinked way of assessing risks.
How ContractKen's AI assisted process helps speed up contract review by up to 50% and with zero errors / oversight
Identifying key clauses present in this contract so that you can focus your attention on the language of those clauses instead of spending time searching for the keyword or phrase
Alerting the user to missing clauses or key terms in the document
Similarity scoring for each of the detected clauses on a scale of 1-10 (compared to your organization’s standards for that clause)
Enabling use of a contract review playbook within MS Word - this is the coolest part of our tech which enables organizations to customize their contract playbook and use it right within word
All of this functionality has multiple NLP models working in unison in the background. However, there are two broad types of algorithms deployed - Pattern Recognition & Deep Learning.
Let's take a look under each one's hood.
We use algorithms like K-Nearest Neighbors(KNN) to recognize patterns in training data. Following (oversimplified to 3 dimensions) diagrams show how a pattern recognition algorithm solves the (relatively) easier problem of identifying contract metadata
KNN is a type of algorithm known as 'Unsupervised Learning' - i.e. the machine will automatically detect patterns of similarity or dissimilarity (across n-dimensions) and sort the data points out into various 'clusters'. In this example, after our data pipelines pre-process and tokenize the data in the training documents dataset and feed it into this algorithm, the model creates 3 distinct clusters - belonging to the key terms like ‘Governing Law’, and ‘Effective Date’ & ‘Expiry Date’.
When a new data point is fed into the system (in production use), the model calculates the distance of the new data point from the center (in an n-dimensional space) of each of the clusters that the model has identified. The model will assign this new data point to the nearest cluster.
This is an over-simplified example of how basic pattern recognition algorithms can be deployed to detect contract terms on the basis of their meanings, not through a keyword search type of approach
There are broadly 2 types of models being used here:
This is to detect the presence of key contract clauses and identify their location in the document. We are leveraging the SQuAD approach to fine-tune several pre-trained language models using the HuggingFace Transformers library. Because the prediction task is similar to extractive question and answering tasks, we use the QuestionAnswering models in the Transformers library. Each ‘Question’ identifies the label category (clause) under consideration. This technique is called ‘Transfer Learning’ in ML.
Take these sentences, for example, 1, “I like to play football” and 2, “I am watching the Julius Cesar play”. The word ‘play’ has different meanings. These models use neural networks as their foundation and consider the semantics of the text.
The model returns the precise location of the clauses (ones which are detected) in the document (starting position and length), which is then used by our word add-in to highlight the relevant text.
We’re using a Transformers-based DL algorithm to detect the presence and location of many key commercial clauses and terms. To understand more about Transformers, the following article is perhaps the best out there: https://jalammar.github.io/illustrated-transformer/
Primary task formulation: The model should predict which substrings of the contract document are related to each clause label category. The model learns the start and end token positions of the substring. This formulation is built from SQuAD 2.0 setup.
The algorithm that we’re using is BERT (short for Bidirectional Encoder Representations from Transformers). This is the original BERT paper created by Google research. We have used multiple variations of BERT were used to optimize the overall Precision & Recall scores, and continue to test variations of simple algorithms, new data, and model parameters to get higher coverage (i.e. more terms/clauses getting predicted), better accuracy, and superior inference performance.
Named Entity Recognition (NER)
This is to identify key business entities in the contract document. For e.g. ‘Parties Names’, Financial values, etc. At ContractKen, we’ve deployed multiple variants of the NER algorithm for specific commercial entities.
This is a fast-changing domain with ever larger and better language models coming into the open source domain every month. At ContractKen, we’re excited and committed to deploying the best-in-class technology to solve a wide variety of challenging problems with the document review process.
Would love to hear your thoughts and questions in the comments section.