Automating Lease Abstract Extraction with Python

A single 90-page commercial lease can hold every input a CAM reconciliation needs — base rent, the tenant’s pro-rata share, the CAM inclusion and exclusion list, an expense cap, a fiscal-year definition, and audit rights — scattered across nested clauses, cross-references, and exhibit tables. Abstracting those fields by hand is the slow, error-prone step that stalls reconciliation before a single number is allocated. This page is a focused implementation recipe for turning a text-layer lease PDF into a typed, confidence-scored record that loads straight into the lease abstraction database, the parent stage this workflow feeds.

Three-stage lease-abstraction pipeline: every lease produces a scored record, and the confidence score — not a hard pass/fail — decides whether it auto-commits, waits for review, or falls back and flags.

Context & When to Use This Approach

Reach for a Python abstraction pipeline when you are onboarding a portfolio faster than an abstractor can key it — a new acquisition of forty tenancies, an annual re-abstraction cycle, or a migration off spreadsheets into a queryable schema. The concrete triggers are consistent across a CRE portfolio:

Born-digital leases with a real text layer. pdfplumber reads the embedded characters and their coordinates directly, so a lease exported from a document-management system extracts cleanly. If the document is a photographed or faxed scan with no text layer, it belongs on the OCR path first, then rejoins this pipeline.
Repeated clause shapes you can pattern-match. Base rent, pro-rata percentages, cap language, and fiscal-year definitions recur with enough regularity that regex plus named-entity extraction captures the operative values without a bespoke parser per landlord.
Ambiguous Triple Net (NNN) language — phrases like “all expenses incurred by Landlord” with no explicit carve-out — where a wrong inclusion silently inflates every tenant’s recovery. Here you want a confidence score and a review queue, not a blind default.

If your leases are clean, single-column, and few, manual abstraction is faster than building this. The pipeline earns its keep at volume, or wherever the same lease must be re-abstracted every reconciliation cycle and you need a reproducible, auditable record instead of a fresh interpretation each year. The extraction mechanics here are a close cousin of coordinate-aware PDF extraction with pdfplumber on the invoice side of the platform — same library, different document grammar.

Step-by-Step Implementation

1. Segment the lease into logical sections

Raw lease text is unusable until it is cut into the sections a CAM reconciliation cares about — Rent, CAM, Exclusions, Pro-Rata, Audit Rights. Use pdfplumber to pull each page’s text with layout preserved, then split on the lease’s own article headings rather than on page breaks, which fall mid-clause.

import re
import pdfplumber

# Article headings vary by landlord; match the common numbered/keyword forms.
SECTION_HEADINGS = {
    "rent": re.compile(r"(?im)^\s*(article|section)?\s*\d*\.?\s*(base\s+rent|rent)\b"),
    "cam": re.compile(r"(?im)^\s*(article|section)?\s*\d*\.?\s*(common\s+area|operating\s+expenses|cam)\b"),
    "exclusions": re.compile(r"(?im)^\s*exclusions?\s+from\s+(operating\s+expenses|cam)\b"),
    "pro_rata": re.compile(r"(?im)^\s*(tenant'?s?\s+)?(pro\s*[- ]?rata|proportionate)\s+share\b"),
    "audit": re.compile(r"(?im)^\s*(audit|inspection)\s+rights?\b"),
}


def load_sections(pdf_path: str) -> dict[str, str]:
    """Return lease text keyed by reconciliation section, preserving layout order."""
    with pdfplumber.open(pdf_path) as pdf:
        full_text = "\n".join(
            page.extract_text(layout=True) or "" for page in pdf.pages
        )

    hits: list[tuple[int, str]] = []
    for name, pattern in SECTION_HEADINGS.items():
        for match in pattern.finditer(full_text):
            hits.append((match.start(), name))
    hits.sort()

    sections: dict[str, str] = {}
    for idx, (start, name) in enumerate(hits):
        end = hits[idx + 1][0] if idx + 1 < len(hits) else len(full_text)
        # Keep the first occurrence; later duplicates are usually cross-references.
        sections.setdefault(name, full_text[start:end].strip())
    return sections

extract_text(layout=True) keeps column alignment so a two-column definitions exhibit does not interleave into gibberish. Splitting on headings instead of pages means a CAM clause that runs across a page boundary stays whole.

2. Extract the operative values from each clause

With sections isolated, pull the numbers a reconciliation actually consumes: the pro-rata percentage, the base rent, and any expense cap. Monetary and share values feed downstream math, so parse them into Decimal immediately — never float, which drifts by fractions of a cent and fails audit ties.

from decimal import Decimal, InvalidOperation

PCT = re.compile(r"(\d{1,3}(?:\.\d+)?)\s*%")
MONEY = re.compile(r"\$\s*([\d,]+(?:\.\d{2})?)")
CAP = re.compile(r"(?i)(cap|ceiling|shall not increase by more than)\D{0,40}?(\d{1,3}(?:\.\d+)?)\s*%")


def parse_decimal_pct(text: str) -> Decimal | None:
    """Convert a percentage string to a fractional Decimal share (e.g. '4.75%' -> 0.0475)."""
    match = PCT.search(text)
    if not match:
        return None
    try:
        return (Decimal(match.group(1)) / Decimal("100")).quantize(Decimal("0.000001"))
    except InvalidOperation:
        return None


def parse_money(text: str) -> Decimal | None:
    match = MONEY.search(text)
    if not match:
        return None
    try:
        return Decimal(match.group(1).replace(",", ""))
    except InvalidOperation:
        return None

Convert every money string through str into Decimal so $1,234.56 becomes Decimal("1234.56") exactly. A pro-rata share stored as Decimal("0.0475") multiplies cleanly against a recoverable pool later without rounding surprises.

3. Score confidence on ambiguous NNN language

An NNN CAM clause that lists inclusions but no exclusions is the classic trap: taken literally it sweeps capital improvements and structural repairs into the recoverable pool. Rather than guess, attach a confidence score to every extracted clause and let the score decide its fate. A Pydantic model gives you validation and a typed contract in one step, and pre-compiled regular expression patterns keep the matching fast across a high-volume batch.

from pydantic import BaseModel, Field


class CAMClause(BaseModel):
    clause_text: str
    category: str
    inclusion_flag: bool
    confidence_score: float = Field(ge=0.0, le=1.0)


EXCLUSION_PATTERNS = [
    re.compile(r"(?i)(excluding|except for|shall not include)"),
    re.compile(r"(?i)(capital improvements|structural repairs|tenant improvements)"),
]
INCLUSION_PATTERNS = [
    re.compile(r"(?i)(operating expenses|common area maintenance|pass-through)"),
]


def evaluate_nnn_clause(text: str) -> CAMClause:
    """Flag a CAM clause as included/excluded and score how sure we are."""
    has_exclusion = any(p.search(text) for p in EXCLUSION_PATTERNS)
    has_inclusion = any(p.search(text) for p in INCLUSION_PATTERNS)

    if has_inclusion and has_exclusion:
        confidence = 0.95   # explicit inclusions AND carve-outs: unambiguous
    elif has_inclusion:
        confidence = 0.60   # inclusions with no exclusions: the NNN trap
    else:
        confidence = 0.40   # no operative language matched
    return CAMClause(
        clause_text=text,
        category="NNN_Operating_Expense",
        inclusion_flag=has_inclusion,
        confidence_score=confidence,
    )

Confidence is a model-quality signal, not money, so float is appropriate here. Reserve Decimal for anything that reaches a reconciliation calculation. The scoring rules are deliberately conservative: a clause that names inclusions but no exclusions scores low precisely because that is where landlords and tenants most often disagree. Mapping the specific carve-outs a clause does contain is the job of a dedicated CAM expense exclusion tracking pass.

4. Route each record by its confidence threshold

Never let extraction hard-fail on a bad scan or an unusual clause. Route on the score so every lease produces some auditable outcome:

≥ 0.90 — auto-commit to the abstraction database.
0.70 – 0.89 — send to a human-in-the-loop review queue with the source page and clause boundaries pre-highlighted.
< 0.70 — apply a portfolio-standard fallback (for example the building-average pro-rata share) and flag the record for post-reconciliation adjustment.

Every routing decision logs its reason, the source page, and a timestamp, so an auditor can reconstruct exactly why a value was auto-committed, corrected, or defaulted.

5. Persist to the abstraction schema

The output feeds a normalized schema that separates lease metadata, CAM definitions, and allocation rules so reconciliation queries can JOIN cleanly and amendments can be tracked over time. Store money and shares as SQL Numeric, which round-trips Decimal losslessly.

from decimal import Decimal
from sqlalchemy import Numeric, String, create_engine
from sqlalchemy.orm import DeclarativeBase, Mapped, Session, mapped_column


class Base(DeclarativeBase):
    pass


class LeaseAbstract(Base):
    __tablename__ = "lease_abstracts"

    id: Mapped[int] = mapped_column(primary_key=True)
    lease_id: Mapped[str] = mapped_column(String, index=True)
    clause_text: Mapped[str] = mapped_column(String)
    pro_rata_share: Mapped[Decimal] = mapped_column(Numeric(9, 6))   # fractional share
    cap_pct: Mapped[Decimal | None] = mapped_column(Numeric(6, 4), nullable=True)
    confidence: Mapped[float] = mapped_column()
    status: Mapped[str] = mapped_column(String)  # auto_approved | review_pending | fallback_applied


def commit_extraction(record: dict[str, object], engine_url: str) -> int:
    """Persist one abstracted lease record and return its primary key."""
    engine = create_engine(engine_url)
    Base.metadata.create_all(engine)
    with Session(engine) as session:
        row = LeaseAbstract(**record)
        session.add(row)
        session.commit()
        return row.id

To keep line items comparable across properties, resolve each extracted category against a controlled vocabulary before it lands here — the discipline covered in standardizing CAM taxonomies across portfolios, which stops “HVAC Maintenance” and “Mechanical Systems Servicing” from fracturing into two categories.

Gotchas & Known Limitations

No text layer, no extraction. pdfplumber reads embedded characters, not pixels. A scanned or photographed lease returns empty strings; detect the empty-text case and route the document to OCR before it reaches Stage 1, or you will silently abstract nothing.
Cross-references defeat single-section matching. A CAM clause that says “subject to the exclusions in Section 7.3” needs both sections resolved together. Keep the whole section text with the record so a reviewer can follow the reference; do not extract on the CAM paragraph alone.
Percentage vs. rentable-square-foot shares. Some leases state a fixed pro-rata percentage; others define it as tenant RSF over building RSF and expect you to compute it. Detect which form the clause uses — a stray % match on an unrelated escalation figure will otherwise poison the share.
Amendments override the original. The base lease and a later amendment can both define CAM. Version the schema on effective_date and let the most recent controlling document win, or reconciliation will use stale terms.
Regex is a first pass, not the whole answer. Pattern matching gets you to a scored draft; low-confidence records are meant for human review, not auto-commit. Treat the 0.70 threshold as a floor to tune against your own false-positive rate, not gospel.
Money as float breaks audit ties. Any value that reaches a reconciliation calculation must be Decimal. A cap parsed as 0.05 float will not tie to the penny against a landlord’s own worksheet.

Verification

Confirm correctness before a record is trusted, using assertions that mirror the lease math rather than the parser’s own output:

from decimal import Decimal


def verify_abstract(record: dict[str, object]) -> None:
    share = record["pro_rata_share"]
    assert isinstance(share, Decimal), "share must be Decimal, not float"
    # A pro-rata share is a fraction of the building; it can never exceed 1.
    assert Decimal("0") < share <= Decimal("1"), f"implausible share: {share}"

    cap = record.get("cap_pct")
    if cap is not None:
        assert Decimal("0") < cap <= Decimal("100"), f"cap out of range: {cap}"

    # Low-confidence records must never carry an auto_approved status.
    if record["confidence"] < 0.90:
        assert record["status"] != "auto_approved", "under-confident record was auto-committed"


verify_abstract({
    "pro_rata_share": Decimal("0.0475"),
    "cap_pct": Decimal("5.0"),
    "confidence": 0.95,
    "status": "auto_approved",
})

Beyond per-record assertions, spot-check the pipeline against a hand-abstracted control set: run ten leases an analyst has already keyed, and confirm the auto-committed shares reconstruct the analyst’s figures to six decimal places. Where they diverge, the divergence is your signal to tighten a pattern or lower a threshold — not to loosen the assertion. Cross-check the extracted shares against a full pro-rata allocation run: the tenant shares in a building should sum to 1 (or to the leased-occupancy fraction under a gross-up), which catches a missed or double-counted tenancy the per-record checks cannot see.

Once records clear these checks they load into the lease abstraction database as the reconciliation’s source of truth; for the clause-level mapping that decides which line items are recoverable in the first place, continue to mapping NNN lease clauses to CAM categories.

Building a Lease Abstraction Database — the parent stage this extractor feeds, and the normalized schema that holds every abstracted field.
How to Map NNN Lease Clauses to CAM Categories — turning the clauses you extract here into recoverable-vs-excluded category assignments.
Best Practices for CAM Expense Exclusion Tracking — capturing the carve-outs that determine a clause’s confidence score.

Automating Lease Abstract Extraction with Python

Context & When to Use This Approach #

Step-by-Step Implementation #

1. Segment the lease into logical sections #

2. Extract the operative values from each clause #

3. Score confidence on ambiguous NNN language #

4. Route each record by its confidence threshold #

5. Persist to the abstraction schema #

Gotchas & Known Limitations #

Verification #

Related #