PDF Invoice Extraction with Python and pdfplumber

Q: When should I use pdfplumber instead of an OCR engine?

Use pdfplumber for digitally generated (vector) PDFs, which carry an extractable text layer and precise glyph coordinates — the format most national property-service vendors issue. Reserve OCR for scanned or photographed receipts that have no text layer. OCR on a clean vector PDF is slower and less accurate than reading its embedded text directly.

Q: Why does the parser leave amounts as raw strings instead of numbers?

Separation of concerns. Extraction's only guarantee is that the correct cells were read; coercing to Decimal and enforcing business rules is the validation gate's job. Keeping candidate rows untyped means a malformed amount is quarantined with full context instead of being silently zeroed or crashing the parse.

Q: How do I stop right-aligned totals from bleeding into the description column?

Tighten snap_tolerance and the intersection tolerances so nearly collinear ruled lines are not merged, then assert that every row's amount column parses cleanly before accepting the table. If a vendor template still bleeds, treat it as a COLUMN_BLEED quarantine and retune tolerances for that template rather than loosening them globally.

Q: What happens when pdfplumber cannot detect the table at all?

Fall back to a hybrid lattice/stream extractor with pandas rather than chasing tolerances indefinitely. Heavily rasterized borders and non-vectorized grids are exactly the case the Tabula-and-pandas approach was built for.

Q: Do I need Decimal if the invoice amounts look like clean two-decimal numbers?

Yes. The risk is not the individual value but the sum: adding hundreds of float cent-values accumulates drift that makes the recoverable pool fail to tie out to the printed subtotal. Decimal stores base-10 values exactly and keeps the pool penny-accurate and auditable.

Commercial real estate CAM reconciliation demands deterministic data ingestion, and the first place that determinism is won or lost is the moment a vendor’s PDF is turned into structured rows. This page is the extraction stage of the Automated Invoice Parsing & Data Ingestion pipeline: it converts digitally generated HVAC, landscaping, janitorial, and utility statements into typed line items that later stages can validate, map, and reconcile. Unstructured invoices introduce reconciliation drift when parsed manually or with brittle regex, so a coordinate-aware extraction layer built on pdfplumber establishes a repeatable foundation across multi-tenant portfolios. This architecture prioritizes geometry-driven table detection, exact decimal money handling, and deterministic schema enforcement to minimize manual review cycles and accelerate month-end close.

Prerequisites & Data Contracts

Coordinate-aware extraction does not run in isolation; it consumes a stored document and emits a candidate row set that downstream stages have already agreed to accept. Three contracts must exist before this stage executes.

An acquired, immutable source file. The upstream acquisition stage of the ingestion pipeline writes each vendor PDF to write-once storage and computes a SHA-256 fingerprint. Extraction receives a file path plus that source_sha256, and it binds every row it produces back to the hash. Extraction never mutates the original bytes — a prerequisite for the audit trail that a CAM reconciliation must reproduce on demand.

A digitally generated (vector) PDF, not a scan. pdfplumber reads the text and vector-drawing objects embedded in a PDF; it has no OCR of its own. Statements from national property-service vendors are almost always vector PDFs and are the target of this stage. Photographed or scanned receipts contain no extractable text layer and must be routed to the OCR path described in the parent pipeline before they reach code here. The extractor’s job is to fail loudly on a text-less page rather than emit empty rows silently.

A downstream schema to fill. Extraction converges on one canonical record shape whose monetary fields are Decimal and whose service_date is kept distinct from invoice_date. Both fields matter because GAAP period-matching assigns a cost to the reconciliation year in which the work was performed, not the year it was billed. The full type battery — cross-field rules, quarantine dispositions, and fixtures — is owned by schema validation for parsed expense data; this stage produces rows that are shaped correctly so that gate has something concrete to check. The vocabulary those rows are eventually sorted into is the expense taxonomy described in defining CAM expense categories in commercial leases.

The data contract in one line: extraction takes (path, source_sha256) and returns a list of candidate rows, each a dictionary of raw string cells plus page/coordinate metadata — untyped by design, because coercion and business rules belong to the validation gate, not the parser.

Table-Detection Algorithm & Extraction Geometry

pdfplumber excels at extracting tabular data by analyzing PDF drawing objects and text positioning. For CRE invoices, vendor layouts vary significantly across HVAC, landscaping, and utility providers. Relying solely on string matching fails when column headers shift, line items wrap, or vendors use sparse grid lines. The reliable approach is geometric: reconstruct the table from the page’s ruled lines and the coordinates of each word, rather than guessing structure from whitespace.

Two detection strategies dominate. The lines strategy builds the cell grid from the invoice’s actual ruled borders and their intersection points — ideal for vendors that render a bordered expense table. The text strategy infers column boundaries from the horizontal gaps between word bounding boxes — the fallback for borderless statements. The two tolerance knobs that decide correctness are the intersection tolerances (how close a horizontal and vertical line must come to be treated as a crossing) and the snap tolerance (how far apart two nearly collinear segments can be before they are merged into one gridline).

Assigning a word to a column is a coordinate-comparison problem. Each word carries a bounding box $(x_0,\ x_1)$ ; its horizontal midpoint

x_{\text{mid}} = \frac{x_0 + x_1}{2}

falls into the column whose left and right gridlines bracket it, $c_j \le x_{\text{mid}} < c_{j+1}$ . Getting the tolerances right is what keeps a right-aligned amount from bleeding into the description column or a wrapped line item from splitting across two rows.

Money is the one field where representation, not just position, must be exact. A parsed amount is only trustworthy once it is an exact base-10 value quantized to whole cents:

\text{gross\_amount} = \big(\text{net} + \text{tax}\big)\ \text{quantized to } \$0.01

Binary float cannot represent most cent values exactly, so a pipeline that sums hundreds of float line items drifts by fractions of a cent and produces a recoverable pool that fails to tie out. Every monetary value therefore flows through Python’s decimal module from the instant it leaves the PDF cell — a rule this stage shares with the pro rata share calculation under BOMA standards that eventually consumes these amounts.

Python Implementation

The following implementation uses page geometry to isolate the primary expense table, extract raw rows, and normalize headers into a structured dictionary. It filters out footers, single-column text blocks, and empty rows, and it tags every record with page and source metadata so lineage survives into persistence.

from __future__ import annotations

import logging
from decimal import Decimal
from typing import Dict, List, Optional

import pdfplumber

logger = logging.getLogger(__name__)

TABLE_SETTINGS = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "intersection_y_tolerance": 8,
    "intersection_x_tolerance": 8,
    "snap_tolerance": 4,
}


def extract_cam_invoice_tables(
    pdf_path: str,
    source_sha256: str = "",
    min_cols: int = 4,
) -> List[Dict[str, object]]:
    """Extract line items from a vector CAM invoice via geometry-driven tables.

    Reconstructs the expense grid from the PDF's ruled lines, normalizes the
    header row into stable snake_case keys, and returns one dict per line item.
    Values are left as raw strings on purpose: coercion and business rules are
    the validation gate's job, not the parser's. Every row is bound to the
    source hash so lineage survives into the reconciliation ledger.
    """
    extracted: List[Dict[str, object]] = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            if not page.extract_text():
                # No text layer => scanned page; hand off to the OCR path.
                logger.warning("no text layer on page %d of %s", page_num, pdf_path)
                continue

            for table in page.extract_tables(table_settings=TABLE_SETTINGS):
                if not table or len(table) < 2:
                    continue

                headers = [
                    h.strip().lower().replace(" ", "_") if h else f"col_{i}"
                    for i, h in enumerate(table[0])
                ]
                if len(headers) < min_cols:
                    continue  # not the expense table (likely a header/footer block)

                for row in table[1:]:
                    if not any(cell and cell.strip() for cell in row):
                        continue
                    record: Dict[str, object] = dict(zip(headers, row))
                    record["_meta_page"] = page_num
                    record["_meta_file"] = pdf_path
                    record["_meta_source_sha256"] = source_sha256
                    extracted.append(record)

    return extracted


def normalize_amount(raw: Optional[str]) -> Decimal:
    """Parse a currency string from a PDF cell into an exact Decimal.

    Handles thousands separators, currency symbols, and parenthesized
    negatives (credits). Raises ValueError on anything it cannot parse so the
    record is quarantined rather than silently zeroed into the recoverable pool.
    """
    if raw is None:
        raise ValueError("missing monetary value")
    cleaned = raw.strip().replace("$", "").replace(",", "")
    negative = cleaned.startswith("(") and cleaned.endswith(")")
    cleaned = cleaned.strip("()")
    if not cleaned:
        raise ValueError(f"empty monetary value: {raw!r}")
    value = Decimal(cleaned)
    return -value if negative else value

Coordinate extraction alone does not guarantee financial accuracy — it guarantees the right cells were read, not that their contents make sense. Parsed values must pass strict validation before entering the reconciliation ledger, which is the subject of the next section.

Validation Rules & Edge Cases

Extraction failure modes on real vendor PDFs are specific and repetitive, and each one has a mitigation that belongs at parse time rather than being discovered during a tenant dispute. The recurring hazards:

Right-aligned amounts merging into the description column when snap_tolerance is too loose. Tighten it and assert that the amount column parses cleanly for every row before accepting the table.
Wrapped line items where one logical expense spans two visual rows. Detect a row whose amount cell is blank but whose description continues, and merge it upward before coercion.
Subtotal and tax rows captured as line items. These inflate the recoverable pool if summed blindly; filter them by label (subtotal, tax, total) and reconcile them separately as a checksum.
Parenthesized credits — (1,250.00) — that a naive parser reads as a positive or drops entirely. normalize_amount handles the sign explicitly.
Text-less (scanned) pages silently yielding zero rows. The extractor logs and skips them so they surface as a missing-page exception, not a quiet undercount.

Once cells are read correctly, the record is coerced and gated against business rules. A Pydantic model makes that schema executable; construction failure is the signal to quarantine.

from __future__ import annotations

from datetime import date
from decimal import Decimal

from pydantic import BaseModel, field_validator, model_validator

CAPITALIZATION_THRESHOLD = Decimal("5000.00")


class ExtractedLineItem(BaseModel):
    """One coerced expense line from a vendor invoice.

    Monetary fields are Decimal to preserve cent-level precision. service_date
    (when work was performed) is distinct from invoice_date so GAAP
    period-matching can assign the cost to the correct reconciliation year.
    """

    property_id: str
    invoice_number: str
    invoice_date: date
    service_date: date
    description_raw: str
    net_amount: Decimal
    tax_amount: Decimal
    is_credit: bool = False
    recon_year: int
    source_sha256: str

    @field_validator("net_amount", "tax_amount")
    @classmethod
    def finite_and_scaled(cls, v: Decimal) -> Decimal:
        if v != v:  # NaN guard
            raise ValueError("amount is NaN")
        if v.as_tuple().exponent < -2:
            raise ValueError("amount carries sub-cent precision")
        return v

    @model_validator(mode="after")
    def business_rules(self) -> "ExtractedLineItem":
        if self.net_amount < 0 and not self.is_credit:
            raise ValueError("negative net amount without credit flag")
        if self.service_date.year != self.recon_year:
            raise ValueError(
                f"service_date {self.service_date} outside recon year {self.recon_year}"
            )
        return self

    @property
    def needs_capital_review(self) -> bool:
        """Amounts at or above the threshold may be capital, not expense."""
        return self.net_amount >= CAPITALIZATION_THRESHOLD

Classifying failures — rather than treating every one the same — is what lets a reviewer clear a close-window quarantine in hours instead of re-parsing documents from scratch.

Failure mode	Trigger at extraction	Disposition
`NO_TEXT_LAYER`	`extract_text()` returns empty on a page	Route to OCR path; flag missing page
`COLUMN_BLEED`	Amount cell contains non-numeric text	Quarantine; retune snap tolerance for vendor template
`WRAPPED_ROW`	Blank amount with continued description	Merge upward, then re-coerce
`SUBTOTAL_ROW`	Label matches subtotal/tax/total	Exclude from pool; use as checksum
`PERIOD_MISMATCH`	`service_date` outside `recon_year`	Prior-year accrual review
`CAP_REVIEW`	`net_amount` ≥ capitalization threshold	Hold for expense/capital decision

When pdfplumber cannot resolve a table at all — heavily rasterized grids, non-vectorized borders, or exotic layouts — the pragmatic move is a hybrid extractor rather than an endless tolerance chase. That fallback, using a lattice/stream engine and pandas, is documented in parsing complex CAM invoices with Tabula and pandas.

Integration Points

The output of this stage is deliberately narrow — a list of candidate rows — but it feeds four distinct downstream consumers, and the seams between them are where reconciliation integrity is preserved.

Validation gate. Candidate rows flow straight into schema validation for parsed expense data, which coerces types, enforces cross-field rules, and diverts failures to a quarantine queue. Extraction’s contract is only that the right cells were read; validation decides whether their contents are trustworthy.

GL routing. Validated records carry a raw description like HVAC Filter Replacement or Parking Lot Sealcoating. The GL code mapping for CAM expenses layer translates those into standardized accounts (for example 5010 - HVAC Maintenance or 5045 - Common Area Utilities) using synonym dictionaries and rule-based routing, eliminating manual journal-entry adjustments.

Reconciliation math. Once mapped, the sum of validated line amounts forms the recoverable pool that the allocation engine distributes across tenants. The extraction stage never computes shares, but the integrity of the pro rata result is bounded by the integrity of these amounts:

\text{recoverable\_pool} = \sum_{i} \text{line\_amount}_i \qquad\Rightarrow\qquad \text{tenant\_share} = \frac{\text{tenant\_rsf}}{\text{total\_rsf}} \times \text{recoverable\_pool}

Scale and persistence. At portfolio scale, extraction is dispatched under async batch processing for high-volume invoices. Because pdfplumber is CPU-bound, it must be offloaded to a ProcessPoolExecutor — a thread pool cannot parallelize it because of the GIL:

import asyncio
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path
from typing import Dict, List

_EXECUTOR = ProcessPoolExecutor()


async def process_invoice_async(pdf_path: Path) -> List[Dict[str, object]]:
    """Run one CPU-bound extraction off the event loop."""
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(
        _EXECUTOR, extract_cam_invoice_tables, str(pdf_path)
    )


async def run_batch_processing(
    invoice_dir: Path, concurrency: int = 10
) -> List[List[Dict[str, object]]]:
    """Extract a close-window batch with a fixed concurrency ceiling."""
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded_task(pdf_path: Path) -> List[Dict[str, object]]:
        async with semaphore:
            return await process_invoice_async(pdf_path)

    tasks = [bounded_task(f) for f in invoice_dir.glob("*.pdf")]
    return await asyncio.gather(*tasks)

Multi-megabyte portfolios can trigger OOM errors when whole documents are loaded at once, so large statements are streamed page by page rather than materialized in full. The page-chunking strategy for statements that span dozens or hundreds of pages is detailed in handling multi-page commercial invoices in Python.

from typing import Dict, Iterator, List

import pdfplumber


def stream_extract_large_pdf(
    pdf_path: str, chunk_size: int = 50
) -> Iterator[List[List[List[str]]]]:
    """Yield extracted tables in memory-efficient chunks to prevent OOM errors."""
    buffer: List[List[List[str]]] = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            for table in page.extract_tables():
                if table:
                    buffer.append(table)
                    if len(buffer) >= chunk_size:
                        yield buffer
                        buffer = []
    if buffer:
        yield buffer

Testing & Verification

Extraction correctness is not an opinion; it is checkable against a known invoice. The verification strategy has three layers.

Golden-file fixtures. Commit a small set of representative vendor PDFs alongside a hand-verified JSON of the rows they should produce. A regression test asserts that extraction reproduces the golden rows exactly — this catches the day a library upgrade or a tolerance tweak silently shifts a column.

Checksum reconciliation. For every extracted table, the sum of line amounts must equal the printed subtotal after credits, within a strict tolerance. Because amounts are Decimal, the tolerance is Decimal("0.00") — an exact tie-out — not a floating-point epsilon.

from decimal import Decimal
from typing import List


def assert_pool_ties_out(
    line_amounts: List[Decimal], printed_subtotal: Decimal
) -> None:
    """Fail fast if extracted line items do not sum to the invoice subtotal.

    Uses exact Decimal equality: cent-accurate reconciliation math must tie
    out to zero, not to a floating-point tolerance.
    """
    computed = sum(line_amounts, Decimal("0.00"))
    if computed != printed_subtotal:
        raise AssertionError(
            f"pool mismatch: extracted {computed} != printed {printed_subtotal}"
        )


def test_credit_line_parses_negative() -> None:
    assert normalize_amount("(1,250.00)") == Decimal("-1250.00")


def test_subtotal_ties_out() -> None:
    lines = [Decimal("4200.00"), Decimal("815.50"), Decimal("-1250.00")]
    assert_pool_ties_out(lines, Decimal("3765.50"))

Property-based edge coverage. Fuzz normalize_amount with generated currency strings — varied separators, symbols, sign conventions, and whitespace — to prove it either returns an exact Decimal or raises, and never returns a silently wrong value. This is where the parenthesized-credit and thousands-separator regressions get caught before a vendor’s month-end statement does.

Frequently Asked Questions

When should I use pdfplumber instead of an OCR engine? Use pdfplumber for digitally generated (vector) PDFs, which carry an extractable text layer and precise glyph coordinates — the format most national property-service vendors issue. Reserve OCR for scanned or photographed receipts that have no text layer. Mixing them wastes cycles: OCR on a clean vector PDF is slower and less accurate than reading its embedded text directly.

Why does the parser leave amounts as raw strings instead of numbers? Separation of concerns. Extraction’s only guarantee is that the correct cells were read; coercing to Decimal and enforcing business rules is the validation gate’s job. Keeping candidate rows untyped means a malformed amount is quarantined with full context instead of being silently zeroed or crashing the parse.

How do I stop right-aligned totals from bleeding into the description column? Tighten snap_tolerance and the intersection tolerances so nearly collinear ruled lines are not merged, then assert that every row’s amount column parses cleanly before accepting the table. If a vendor template still bleeds, treat it as a COLUMN_BLEED quarantine and retune tolerances for that template rather than loosening them globally.

What happens when pdfplumber cannot detect the table at all? Fall back to a hybrid lattice/stream extractor with pandas rather than chasing tolerances indefinitely. Heavily rasterized borders and non-vectorized grids are exactly the case the Tabula-and-pandas approach was built for.

Do I need Decimal if the invoice amounts look like clean two-decimal numbers? Yes. The risk is not the individual value but the sum: adding hundreds of float cent-values accumulates drift that makes the recoverable pool fail to tie out to the printed subtotal. Decimal stores base-10 values exactly and keeps the pool penny-accurate and auditable.

From Extraction to Reconciliation

Deterministic PDF extraction is the backbone of accurate CAM reconciliation: it decides, at the earliest possible point, whether every downstream number rests on the exact cells and exact cents of the source document. By combining geometry-driven table detection, decimal-precise money handling, explicit failure taxonomies, and streaming batch processing, property management teams eliminate reconciliation drift, shorten month-end close, and keep an audit-ready trail from PDF to recoverable pool. From here the candidate rows move into schema validation for parsed expense data, get routed by GL code mapping for CAM expenses, and scale under async batch processing for high-volume invoices — the rest of the Automated Invoice Parsing & Data Ingestion pipeline this stage feeds.

Handling multi-page commercial invoices in Python — page-chunking and streaming strategies for statements that span dozens of pages.
Parsing complex CAM invoices with Tabula and pandas — the hybrid fallback when pdfplumber cannot resolve a rasterized or borderless table.
Schema validation for parsed expense data — the Pydantic gate and quarantine queue that judge these candidate rows.
GL code mapping for CAM expenses — routing normalized descriptions to the chart of accounts.
Async batch processing for high-volume invoices — running extraction concurrently across a close-window batch.

PDF Invoice Extraction with Python and pdfplumber

Prerequisites & Data Contracts #

Table-Detection Algorithm & Extraction Geometry #

Python Implementation #

Validation Rules & Edge Cases #

Integration Points #

Testing & Verification #

Frequently Asked Questions #

From Extraction to Reconciliation #

Related #