AI for Contract Drafting and Review

36+ Months of Development

How ContractKen's AI Has Evolved

ContractKen started building contract AI before ChatGPT launched. The system has gone through four distinct phases, each adding a layer of capability on top of the last.

Phase 1: 2022 - 2023

Foundation - Pattern Recognition & Classification

The initial system focused on teaching machines to identify and classify contract clauses. We fine-tuned BERT-based models using a SQuAD-style question-answering formulation: "Where is the arbitration clause?" and "How similar is this indemnification language to our benchmark?" K-Nearest Neighbors clustering handled standard clause recognition across large contract sets. Named Entity Recognition (spaCy + custom models) extracted parties, dates, monetary values, and defined terms.
BERT (fine-tuned) SQuAD Q&A KNN Clustering NER (spaCy) Entity Extraction
Phase 2: 2023 - 2024

Intelligence - NLI, Fine-Tuning & Multi-Model Architecture

The system moved from pattern matching to reasoning. We adopted DeBERTa for Natural Language Inference (NLI) - the ability to determine whether a contract clause entails, contradicts, or is neutral relative to a playbook standard. This became the backbone of playbook compliance checking. The architecture evolved into a multi-model system with task-specific routing: clause classification routed to DeBERTa, entity extraction to the NER pipeline, risk scoring to specialized classifiers. Each model was fine-tuned on legal corpora, evaluated on domain-specific benchmarks (F1, precision, recall).
DeBERTa (NLI) Fine-tuning Model Routing Task-Specific Classifiers Domain Evaluation
Phase 3: 2024 - 2025

Generation - LLM Integration with RAG & the Moderation Layer

Large language models added the ability to explain issues, suggest mitigations, and generate redline text. But feeding raw contract text to LLMs was a non-starter for legal confidentiality. We built the Moderation Layer - an architectural privacy control that masks confidential information (party names, deal values, proprietary terms) before any text reaches an LLM. Retrieval-Augmented Generation (RAG) grounded every LLM output against the organization's clause library, playbooks, and precedents. The AI stopped hallucinating because it was forced to cite its sources.
LLM Integration RAG Moderation Layer Semantic Chunking Source Attribution
Phase 4: 2025 - Present

Orchestration - The Compound AI System

Today, a single contract review triggers a coordinated pipeline of specialized models. The system parses the document structure, segments clauses semantically, classifies each clause via NLI, extracts entities, scores risks against playbook positions, retrieves relevant knowledge (clauses, precedents, standards), generates analysis and redlines via LLM, and post-processes everything into clean Word tracked changes. Each step uses the right model for the job. This is a compound AI system with multiple specialized components - the opposite of a single LLM call.
Compound AI System Pipeline Orchestration Playbook Enforcement Precedent-Based Drafting Analytics & Benchmarking
What Happens When You Click "Review"

The Contract Review Pipeline

A single contract review triggers a coordinated sequence of specialized models and processing steps. Here is what happens under the hood.

Why this matters: A production-grade contract review system is a compound AI system with multiple specialized components working in sequence. A single LLM call cannot parse document structure, classify 100+ clause types, extract entities, check playbook compliance, retrieve relevant precedents, AND generate accurate redlines. Each step requires a different model optimized for a different task.
1

Document Parsing & Structure Extraction Rule-based + ML

The contract is parsed into its structural components: recitals, definitions, substantive provisions, general provisions, schedules, and signature blocks. Section numbering, heading hierarchy, and cross-reference targets are identified. This structural understanding is critical because a limitation of liability clause depends on definitions defined elsewhere in the document.
Handles DOCX, PDF, RTF, and image formats. OCR applied where needed. The parser respects document hierarchy rather than treating the contract as flat text.
2

Clause Segmentation & Classification DeBERTa NLI

Each provision is segmented into semantic clause units and classified using DeBERTa-based Natural Language Inference. The model determines the clause type (indemnification, limitation of liability, termination, IP assignment, etc.) across 100+ categories. Classification uses NLI rather than keyword matching - the model understands that "neither party shall be liable for incidental damages" is a consequential damages exclusion even though it never uses that phrase.
Fine-tuned on legal corpora. Evaluated on domain-specific benchmarks (F1, precision, recall). See the NLI Deep Dive below for how this works.
3

Entity Extraction (NER) spaCy + Custom Models

Named Entity Recognition identifies and extracts structured data from unstructured text: party names, dates, monetary values, defined terms, jurisdiction references, and regulatory citations. Custom entity types extend standard NER categories for legal-specific patterns (e.g., notice periods, renewal terms, cap multipliers).
Extracted entities feed into multiple downstream processes: the Moderation Layer uses them for anonymization, the risk assessment uses them for quantitative checks (e.g., "is this cap below our minimum?"), and the formatting checker uses them for consistency validation.
4

Risk Assessment & Playbook Scoring NLI + Scoring Models

Each classified clause is scored against the organization's playbook positions. The NLI model determines whether the clause language entails, contradicts, or is neutral relative to each playbook position (preferred, fallback, walkaway). Clauses below walkaway are flagged as high risk. Clauses between fallback and walkaway are medium risk. Missing clause types required by the playbook are identified through gap analysis.
Severity ranking is configurable per organization. A clause at "fallback" level may be acceptable for routine vendor agreements but flagged as high risk for high-value M&A transactions.
5

Moderation Layer (Privacy Gate) NER + Regex + Custom Rules

Before any text reaches an external LLM, the Moderation Layer intercepts it. Using the entities extracted in Step 3 plus configurable regex patterns and customer-defined dictionaries, confidential information is replaced with opaque tokens: party names become [PARTY_A], monetary values become [AMOUNT], proprietary terms become [TERM_1]. A mapping table is maintained so originals can be restored in the output. The raw text never leaves the client environment unprotected.
Organizations can configure which entity types to mask. The system supports custom dictionaries for trade names, project codes, and internal terminology. Full technical details on the Moderation Layer page.
6

Retrieval-Augmented Generation (RAG) Embeddings + Vector Search

The system retrieves relevant context from multiple knowledge sources before generating any analysis. For a flagged indemnification clause, RAG pulls: the organization's preferred indemnification language from the clause library (all 3 positions), the playbook guidance note for this clause type, relevant precedent language from prior deals, and industry benchmark data. This context is injected into the LLM prompt so every output is grounded in the organization's own standards.
ContractKen uses semantic section-aware chunking rather than fixed-size text splits. Each chunk carries metadata about its position in the contract hierarchy, related definitions, and cross-references. See the RAG Architecture section below.
7

Analysis & Redline Generation (LLM) LLM with RAG Context

With the anonymized clause text, risk scores, playbook positions, and retrieved context assembled, the LLM generates three outputs: (1) an explanation of why the clause was flagged, (2) a mitigation strategy referencing the playbook, and (3) specific redline text using language from the clause library. The model is constrained to cite its sources - every suggestion links back to a playbook position, a clause library entry, or a precedent.
Model routing directs different tasks to different LLMs based on the requirements. Extended-reasoning models handle complex multi-clause analysis. Faster models handle straightforward substitutions. The routing layer selects the optimal model per task.
8

Post-Processing & Word Integration Office.js + Formatting

The Moderation Layer's mapping table restores original party names and values in the output. Redlines are formatted as standard Word tracked changes using the Office.js API. Explanatory comments are inserted alongside each redline. The output is indistinguishable from manual edits - the counterparty sees normal tracked changes with no formatting artifacts or AI indicators.
The post-processor also handles defined term consistency, cross-reference validation, and numbering checks. Results appear in the ContractKen sidebar organized by severity (high risk first), with one-click navigation to each clause location in the document.
The Core Technique

Natural Language Inference for Clause Compliance

ContractKen uses DeBERTa-based Natural Language Inference to determine whether a contract clause complies with, deviates from, or contradicts a playbook standard. This is fundamentally different from keyword matching.

How NLI Works in Practice

Natural Language Inference classifies the relationship between two text segments as entailment (A supports B), contradiction (A conflicts with B), or neutral (no clear relationship).

For contract review, the premise is the playbook standard and the hypothesis is the contract clause. The model determines whether the clause satisfies, violates, or partially addresses the standard.

This is critical because contracts express the same concepts in vastly different language. A limitation of liability might say "aggregate liability shall not exceed" or "total exposure is capped at" or "cumulative damages are limited to" - all expressing the same idea. Keyword matching fails here. NLI understands the semantic relationship.

ContractKen uses DeBERTa (Decoding-enhanced BERT with disentangled attention) for NLI because its disentangled attention mechanism handles long, complex legal sentences more effectively than standard BERT. The models are fine-tuned on legal corpora and evaluated using domain-specific benchmarks.

Example: Indemnification Compliance Check
Premise (Playbook)
"Vendor shall indemnify Client for breach, IP infringement, and willful misconduct, including reasonable attorneys' fees."
Hypothesis (Contract)
"Vendor shall indemnify Client against all losses arising from Vendor's negligence."
CONTRADICTION - Scope limited to negligence only. Missing: breach, IP infringement, willful misconduct, fee recovery.
Example: IP Ownership Check
Premise (Playbook)
"All work product and deliverables shall be owned by Client."
Hypothesis (Contract)
"All intellectual property created in the performance of Services shall be the sole and exclusive property of Client, including all copyrights, patents, and trade secrets therein."
ENTAILMENT - Contract clause meets and exceeds playbook standard. No action needed.
Example: Force Majeure Check
Premise (Playbook)
"Force majeure clause must include pandemic, epidemic, and government-mandated lockdowns as qualifying events."
Hypothesis (Contract)
"Neither party shall be liable for delays caused by acts of God, war, terrorism, or natural disasters."
NEUTRAL - Traditional force majeure language present but does not address pandemic/epidemic events. Recommend expanding.

Why NLI Over Keyword Matching?

Keyword-based contract analysis looks for specific words ("indemnify", "limitation", "terminate"). It breaks when contracts use synonyms, passive constructions, or nested references. NLI understands meaning at the sentence level. It can determine that "the aggregate exposure of the service provider under this instrument shall be constrained to a sum equal to the consideration received" means the same thing as "vendor liability is capped at fees paid" - even though the two sentences share almost no keywords.

Playbook Compliance

Each clause checked against preferred, fallback, and walkaway positions using entailment/contradiction scoring.

Clause Classification

Identifying clause types across 100+ categories, even when the language is non-standard or jurisdiction-specific.

Gap Detection

Determining which required clause types are absent from a contract by checking the full document against the playbook's required provisions.

Grounding AI in Your Standards

RAG Architecture for Legal Documents

Standard Retrieval-Augmented Generation fails for contracts because it ignores document structure, cross-references, and the dependency relationships between clauses. ContractKen's RAG is built for legal documents specifically.

Standard RAG

Where Generic RAG Breaks Down

  • Fixed-size chunking splits clauses mid-sentence or separates a provision from its carve-outs
  • No awareness that "as defined in Section 1.3" creates a dependency on another part of the document
  • Retrieves text by cosine similarity alone, missing structurally related provisions
  • No distinction between recitals, definitions, operative clauses, and schedules
  • Embedding models trained on general text miss legal-specific semantic relationships
ContractKen RAG

How ContractKen Handles It

  • Semantic section-aware chunking that respects clause boundaries and contract hierarchy
  • Cross-reference resolution: when a clause references "Section 4.2", that section is pulled automatically
  • Metadata enrichment: each chunk carries its parent section, related definitions, and document position
  • Multiple retrieval sources activated per task (clause library + playbook + precedents)
  • Every AI output cites which source document or playbook position it drew from

Knowledge Layers Retrieved Per Review

Clause Library
700+ pre-drafted clauses in 3 negotiation positions. When a deviation is flagged, the system retrieves the appropriate position's language as a suggested replacement.
Playbooks
The organization's defined positions (preferred, fallback, walkaway) for each clause type, along with guidance notes and negotiation reasoning.
Precedents
Prior contracts from the organization's deal history. When drafting, the system retrieves structurally similar precedents to inform clause language and deal terms.
Industry Standards
Benchmark data on market-standard positions by clause type, contract category, and jurisdiction. Used for Comprehensive Review when no playbook is configured.

Example: RAG in Action for an Indemnification Clause

Flagged Clause"Vendor shall indemnify Client against losses arising from negligence." Classified as indemnification, scored as below fallback.
Retrieval: PlaybookFetches playbook rule for indemnification: preferred = breach + IP + misconduct + fees, fallback = breach + misconduct, walkaway = negligence + misconduct
Retrieval: Clause LibraryFetches preferred indemnification clause text (full language with IP carve-out, fee recovery, and survival provision).
Retrieval: Guidance Note"Always push for IP carve-out in software deals. Concede fee recovery before conceding IP. Reference 2023 CloudTech precedent."
LLM Output (Grounded)Generates explanation citing the specific deviation, suggests redline text from clause library preferred position, and includes comment referencing the playbook guidance note.
Privacy by Architecture

The Moderation Layer

Confidential contract text is masked before it reaches any external AI model. This is an architectural control, enforced at the system level.

How It Works (Summary)

The Moderation Layer sits between the contract text and the AI processing layer. It intercepts outbound text, identifies confidential entities using the NER models from the extraction pipeline, applies configurable masking rules, and maintains a mapping table for de-masking the output.

  • 1NER-based entity detection identifies party names, monetary values, dates, proprietary terms, and custom entity types
  • 2Regex pattern matching catches structured data (email addresses, phone numbers, account numbers) that NER may miss
  • 3Custom dictionaries allow organizations to define their own sensitive terms (trade names, project codes, internal product names)
  • 4Mapping table maintains the relationship between masked tokens and original values for de-masking on return
  • 5Configurable per organization - each team controls which entity types are masked and which custom terms are protected

Full technical deep dive on the Moderation Layer →

Example: What the AI Sees

Original Contract Text
"Acme Corporation ("Buyer") shall pay GlobalTech Solutions ("Seller") the sum of $4,750,000 upon completion of the Phase 2 deliverables described in Schedule B of the Master Services Agreement dated January 15, 2026."
After Moderation Layer
"[PARTY_A] ("Buyer") shall pay [PARTY_B] ("Seller") the sum of [AMOUNT_1] upon completion of the [PROJECT_REF] deliverables described in Schedule B of the Master Services Agreement dated [DATE_1]."
AI Analysis (on masked text)
The AI analyzes contract structure, clause compliance, and risk using the masked version. It never sees "Acme Corporation", "$4,750,000", or "Phase 2". When findings are returned, the mapping table restores the original values in the output.
The Right Model for Each Task

Model Routing & Orchestration

Different tasks have different requirements. Clause classification needs precision. Entity extraction needs speed. Risk analysis needs reasoning. ContractKen routes each task to the model best suited for it.

TaskModel TypeOptimized ForWhy This Model
Clause Classification DeBERTa (NLI) Precision Clause type identification requires high-precision classification across 100+ categories. DeBERTa's disentangled attention handles long legal sentences where standard BERT struggles.
Entity Extraction spaCy + Custom NER Speed + Coverage Entity extraction runs on every sentence in the document. It needs to be fast and comprehensive. spaCy's pipeline architecture with custom legal entity types provides both.
Playbook Compliance DeBERTa (NLI) Scoring Rules Accuracy NLI determines entailment/contradiction against each playbook position. Scoring rules map NLI output to severity levels (above preferred, below preferred, below fallback, below walkaway).
Risk Analysis & Explanation LLM (Extended Reasoning) Reasoning Depth Explaining why a clause is risky and how to mitigate it requires multi-step reasoning. Extended-reasoning LLMs handle the nuance of "this clause creates risk because of its interaction with Section 4 and the definition of 'Material Adverse Change' in Section 1.2."
Redline Generation LLM with RAG Quality + Source Fidelity Redline text is generated by the LLM but constrained by RAG-retrieved clause library language. The model selects from existing approved language rather than inventing new phrasing.
Formatting & Proofing Rule-based ML Hybrid Determinism Defined term consistency, cross-reference validation, and numbering checks require deterministic correctness. Rule-based checks handle structural validation; ML handles semantic checks (is this term being used consistently across contexts?).
Document Summarization LLM (Fast) Speed Contract summaries need to be generated quickly for preview and triage. Faster LLMs handle summarization while heavier models are reserved for detailed analysis.

Why Multiple Models?

A single LLM cannot be simultaneously optimized for speed (entity extraction on thousands of sentences), precision (clause classification across 100+ types), reasoning depth (multi-clause risk analysis), and determinism (formatting validation). Routing tasks to specialized models means each component operates at peak performance for its specific job.

Model Swappability

The routing architecture is model-agnostic at each decision point. When a new model outperforms the current one on a specific task, it can be swapped in without rebuilding the pipeline. This is how ContractKen has evolved through four generations of AI in three years - the architecture stays stable while individual components improve.

Continuous Improvement

Fine-Tuning & the Human Feedback Loop

ContractKen's models improve over time through domain-specific fine-tuning on legal corpora and a structured feedback loop that incorporates lawyer corrections into future model behavior.

What Fine-Tuning Means in Practice

Pre-trained transformer models (BERT, DeBERTa) have broad language understanding but lack the precision needed for legal clause classification out of the box. Fine-tuning retrains the model's upper layers on labeled legal data - thousands of annotated clause examples across 100+ categories - while preserving the foundational language understanding in the lower layers.

The result: a model that understands general language structure AND recognizes that "the aggregate exposure of the service provider" means "vendor liability cap" in a contract context.

For organization-specific adaptation, fine-tuning can incorporate a firm's own contract corpus. A firm that negotiates IP-heavy technology agreements will have different clause patterns than one focused on commercial real estate leases. The fine-tuned model reflects these domain-specific patterns.

How Lawyer Feedback Improves the System

Every time a lawyer accepts, modifies, or rejects a ContractKen suggestion, that decision becomes a training signal. Accepted suggestions validate the model's judgment. Modified suggestions show where the model was directionally correct but needed refinement. Rejected suggestions identify areas where the model's reasoning diverged from the lawyer's expertise.

These signals are aggregated, reviewed, and periodically incorporated into model updates through supervised fine-tuning. The system does not retrain in real-time on individual interactions - it accumulates feedback and retrains in controlled cycles with human review of the training data.

The Improvement Cycle
1

Model Generates Output

Clause classifications, risk assessments, and redline suggestions produced for a contract review.

2

Lawyer Reviews & Acts

Accepts, modifies, or rejects each suggestion. These actions are logged as feedback signals.

3

Feedback Aggregated

Signals collected across reviews. Patterns identified: which clause types have high accept rates? Where does the model consistently need correction?

4

Supervised Retraining

Curated training data (with human review) used to fine-tune models in controlled cycles. No automatic retraining on raw user data.

5

Evaluation & Deployment

Updated models evaluated against held-out test sets (F1, precision, recall) before deployment. Performance must improve or remain stable on all benchmarks.

How We Measure Model Performance

F1 Score
Harmonic mean of precision and recall. The primary metric for clause classification accuracy across all 100+ types.
Precision
Of all clauses the model flagged as a specific type, what percentage were correct? High precision reduces false positives.
Recall
Of all actual instances of a clause type, what percentage did the model identify? High recall ensures nothing is missed.
Accept Rate
Percentage of AI suggestions accepted by lawyers without modification. A practical, production-level quality indicator.
Security & Trust

Built for Enterprise Legal Teams

Legal AI handles some of the most sensitive documents in an organization. The architecture reflects that responsibility at every level.

Data Privacy

Confidential information is protected by architectural controls at the system level.

  • Moderation Layer masks entities before AI processing
  • No customer data used for model training
  • Data isolation per organization
  • Customer-configurable anonymization rules
  • Mapping tables discarded after processing

Infrastructure Security

Enterprise-grade security infrastructure with certification and compliance.

  • SOC 2 Type II certification in progress
  • AES-256 encryption at rest
  • TLS 1.2+ encryption in transit
  • AWS hosting with regional data residency options
  • 99.5% uptime SLA

Compliance & Governance

Designed for regulated industries and attorney-client privilege requirements.

  • GDPR and CCPA compliant
  • ISO 27001/27701 aligned
  • Customers own all AI outputs
  • Audit logging for all AI interactions
  • Role-based access controls

The Lawyer Remains the Final Decision Maker

ContractKen generates suggestions. The lawyer accepts, modifies, or rejects each one. Nothing is applied to the document without explicit human approval. Every tracked change and comment can be reviewed before the contract goes to the counterparty. AI assists the judgment - it does not replace it.

Technical FAQ

Claude / ChatGPT is a single general-purpose LLM. ContractKen is a compound AI system with multiple specialized models working in a coordinated pipeline. Clause classification uses fine-tuned DeBERTa (NLI). Entity extraction uses spaCy NER. Risk scoring uses playbook-specific classifiers. The LLM is one component in an 8-step pipeline, constrained by RAG context and the Moderation Layer. A single LLM call cannot reliably parse document structure, classify 100+ clause types, check playbook compliance, AND generate accurate redlines. Each task requires a model optimized for that specific job.
ContractKen uses multiple models, each selected for a specific task. DeBERTa (fine-tuned on legal corpora) handles clause classification and playbook compliance via Natural Language Inference. spaCy with custom entity types handles Named Entity Recognition. Large language models handle analysis, explanation, and redline generation, grounded by RAG context. Rule-based systems handle formatting validation and cross-reference checking. The routing layer selects the optimal model for each task in the pipeline.
Three controls. First, the LLM is constrained by RAG - it retrieves specific clause library language, playbook positions, and precedent text before generating any output, and is required to cite its sources. Second, upstream NLI classification provides an independent check on clause identification - the LLM does not decide what type of clause it's looking at; DeBERTa already classified it. Third, the human-in-the-loop design means every suggestion is reviewed by the lawyer before it is applied to the document. The LLM generates proposals, not final outputs.
Yes, at multiple levels. Playbooks codify your organization's specific negotiation positions and risk thresholds. The clause library can include your proprietary clause language alongside the 700+ pre-built clauses. The RAG knowledge layer retrieves from your precedents and standards. The Moderation Layer's anonymization rules are configurable per organization. For enterprise deployments, fine-tuning on your contract corpus is available to improve classification accuracy for your specific clause patterns and drafting conventions.
Contract text passes through the Moderation Layer before reaching any external AI model. The Moderation Layer masks confidential entities (party names, amounts, proprietary terms) using NER and configurable rules. The masked text is processed by the AI pipeline. The mapping table restores original values in the output. No customer data is used for model training. Data is encrypted at rest (AES-256) and in transit (TLS 1.2+).
ContractKen evaluates clause classification using F1 score, precision, and recall on held-out legal test sets. The fine-tuned DeBERTa models are trained on thousands of annotated clause examples across 100+ categories. Published research on similar architectures (Legal-BERT on construction contracts) has achieved F1 scores above 0.93. ContractKen's production models are evaluated against domain-specific benchmarks before every deployment, and performance must remain stable or improve to proceed.
Natural Language Inference (NLI) determines the logical relationship between two text segments: entailment (one supports the other), contradiction (they conflict), or neutral (no clear relationship). In contract review, NLI compares each clause against the playbook standard to determine compliance. This is more reliable than keyword matching because contracts express the same concepts in vastly different language across jurisdictions, drafting styles, and industries. DeBERTa's disentangled attention mechanism handles the long, complex sentence structures common in legal drafting.
ContractKen has been building contract AI since 2022, through four phases. Phase 1 (2022-2023) built the NLP foundation: fine-tuned BERT for clause detection, KNN clustering for standard clause recognition, and NER for entity extraction. Phase 2 (2023-2024) added intelligence: DeBERTa for NLI-based compliance checking, multi-model routing, and domain-specific fine-tuning. Phase 3 (2024-2025) integrated LLMs with RAG grounding and the Moderation Layer for privacy. Phase 4 (2025-present) orchestrates the full compound AI system with playbook enforcement, precedent-based drafting, and analytics. The architecture has remained stable while individual components have been upgraded through four generations.

ContractKen uses a compound AI system for contract review and drafting inside Microsoft Word. The pipeline includes DeBERTa-based Natural Language Inference for clause classification and playbook compliance, spaCy NER for entity extraction, Retrieval-Augmented Generation grounded against clause libraries and playbooks, and large language model integration for analysis and redline generation. The Moderation Layer masks confidential information before AI processing using NER, regex patterns, and customer-configurable dictionaries. The system has evolved through four phases since 2022, progressing from fine-tuned BERT models to a multi-model orchestration architecture with task-specific routing.