Handling Multi-Page Commercial Invoices in Python

A single facilities vendor statement can run twenty or thirty pages, with one logical CAM (Common Area Maintenance) charge table broken across page boundaries, repeated column headers on every continuation sheet, and a “Page 3 of 12” footer that a naive parser happily ingests as a line item. This page tackles that one specific edge case — stitching a multi-page commercial invoice into a single, penny-accurate record — as a focused extension of coordinate-aware table extraction with pdfplumber, the extraction stage of the broader Automated Invoice Parsing & Data Ingestion pipeline. Get the page-stitching wrong and you either double-count a repeated header row or drop the continuation lines that carry half the recoverable expense, and both defects survive silently until a tenant disputes the year-end statement.

The extraction loop streams the statement one page at a time: pdfplumber loads a single page, its rows are yielded to the stitcher, then the page is closed and freed before the next loads — so a 30-page portfolio scan never holds more than one page of objects in memory.

Context & When to Use This Approach

Reach for a stateful, cross-page parser the moment a vendor’s PDF stops being one self-contained table per document. The concrete triggers are specific and recognizable in a CRE feed:

A janitorial or landscaping vendor bills a whole portfolio on one statement, and the line-item table wraps from page 1 onto pages 2 through 5, repeating its Description | Qty | Unit | Amount header on each sheet.
The invoice carries a running “Subtotal carried forward” value at each page break, and the true grand total appears only on the final page.
Structural noise — “Page X of Y” footers, remit-to boilerplate, and page-level page numbers — sits inside the same coordinate band the table detector picks up, so it leaks into the extracted rows.
The file is large enough (multi-megabyte scans of many properties) that loading every page’s objects into memory at once during a month-end close batch risks garbage-collection stalls or an out-of-memory kill.

When a document is genuinely one page, or one clean table per page with no logical continuation, you do not need any of this — treat each page independently and move on. The techniques below are specifically for the case where several physical pages compose one logical invoice whose totals must reconcile end to end. For scanned or rotated tables the isolation strategy differs; that belongs to parsing complex CAM invoices with Tabula and Pandas rather than here.

Assembling one logical invoice from N physical pages: the column header is fingerprinted once on page 1 and skipped on every continuation sheet, real line rows accumulate in a single running buffer with a Decimal subtotal, and the grand total declared on the final page is reconciled against that subtotal within a one-cent tolerance.

Step-by-Step Implementation

The core pattern is a generator that streams one page at a time, a header fingerprint that lets continuation pages skip their repeated headers, and a stitcher that accumulates rows into one invoice while reconciling the carried subtotal. Every monetary value is parsed through Python’s decimal module — never float — so that summing hundreds of line rows across pages does not drift the invoice out of tolerance for reasons unrelated to the data.

Step 1 — Stream pages instead of materializing the whole document

Open the PDF once and yield structured rows page by page. pdfplumber lazily loads each page’s objects, and closing the page (via the context manager) releases them, so peak memory stays bounded to a single page rather than the whole statement. This is what keeps a 30-page portfolio scan from evicting the rest of a batch worker’s heap.

from __future__ import annotations

from dataclasses import dataclass, field
from decimal import Decimal, InvalidOperation
from typing import Iterator, Optional

import pdfplumber

# Header cells that mark the start of the line-item table on any page.
HEADER_FINGERPRINT = ("description", "qty", "unit", "amount")


@dataclass
class RawPage:
    """Rows extracted from one physical page, before stitching."""

    page_number: int
    rows: list[list[str]]


def iter_pages(pdf_path: str) -> Iterator[RawPage]:
    """Yield one page's table rows at a time, bounding memory to a single page.

    Each yielded RawPage is fully self-contained, so a downstream stitcher can
    hold only a rolling buffer of line items rather than the whole document.
    """
    with pdfplumber.open(pdf_path) as pdf:
        for index, page in enumerate(pdf.pages, start=1):
            table = page.extract_table() or []
            # Normalize None cells (empty pdfplumber cells) to empty strings.
            rows = [[(cell or "").strip() for cell in row] for row in table]
            yield RawPage(page_number=index, rows=rows)

Step 2 — Fingerprint the header so continuation pages skip it

A multi-page vendor table repeats its column header on every sheet. Detect that header once by matching its cells against a known fingerprint, then drop any row on a later page that matches the same fingerprint. This is the single check that prevents the classic double-count, where the header Description | Qty | Unit | Amount is ingested as a phantom expense line on pages 2 onward.

def is_header_row(row: list[str]) -> bool:
    """True when a row is a repeated column header, not a real expense line."""
    lowered = [cell.lower() for cell in row]
    return all(token in " ".join(lowered) for token in HEADER_FINGERPRINT)


def is_structural_noise(row: list[str]) -> bool:
    """Filter page footers, 'Page X of Y' markers, and carried-forward rows."""
    joined = " ".join(row).lower()
    if not joined.strip():
        return True
    noise_markers = ("page ", " of ", "carried forward", "continued", "remit to")
    return any(marker in joined for marker in noise_markers)

Step 3 — Parse money defensively into two-place Decimals

Continuation pages are where OCR-adjacent noise creeps in: a currency symbol here, a stray thousands separator there, a subtotal marker fused to an amount. Coerce every monetary cell through one function so the arithmetic that reconciles the invoice never sees a raw string or a binary float.

CENT = Decimal("0.01")


def parse_money(cell: str) -> Optional[Decimal]:
    """Parse a currency cell to a two-place Decimal, or None if it is not money.

    Strips $ and thousands separators; returns None for blank or non-numeric
    cells so the caller can distinguish 'no amount here' from 'zero dollars'.
    """
    cleaned = cell.replace("$", "").replace(",", "").strip()
    if not cleaned:
        return None
    try:
        return Decimal(cleaned).quantize(CENT)
    except InvalidOperation:
        return None

Step 4 — Stitch pages into one logical invoice

Now walk the page stream, appending real expense rows to a single buffer while skipping headers and noise. Track the running subtotal in Decimal so it can be reconciled against whatever grand total the final page declares. The stitcher holds only the accumulated line items and a single scalar subtotal — never more than one page of raw objects at a time.

@dataclass
class LineItem:
    description: str
    amount: Decimal


@dataclass
class StitchedInvoice:
    line_items: list[LineItem] = field(default_factory=list)
    running_subtotal: Decimal = Decimal("0.00")
    pages_seen: int = 0


def stitch_invoice(pdf_path: str) -> StitchedInvoice:
    """Assemble one logical invoice from all physical pages of a PDF."""
    invoice = StitchedInvoice()
    for raw in iter_pages(pdf_path):
        invoice.pages_seen += 1
        for row in raw.rows:
            if is_header_row(row) or is_structural_noise(row):
                continue
            amount = parse_money(row[-1]) if row else None
            if amount is None:
                continue  # not an expense line (label row, spacer, etc.)
            description = row[0]
            invoice.line_items.append(LineItem(description, amount))
            invoice.running_subtotal += amount
    return invoice

Step 5 — Reconcile the stitched total against the declared grand total

The final page of a well-formed statement declares a grand total. Extract it and confirm the sum of the stitched line items matches within a one-cent tolerance — the same reconciliation identity the pipeline enforces in schema validation for parsed expense data, applied here at the extraction seam where cross-page defects actually originate.

def reconcile(invoice: StitchedInvoice, declared_total: Decimal) -> bool:
    """True when the stitched line items reconcile to the invoice's grand total."""
    return abs(invoice.running_subtotal - declared_total) <= CENT

Gotchas & Known Limitations

Multi-page stitching fails in predictable, vendor-specific ways. Treat this as a pre-flight checklist before trusting a parser against a new statement template:

Repeated headers counted as line items. If the header fingerprint is too loose (or absent), page 2’s header row becomes a phantom charge. Fingerprint on the exact column set, and assert that the header appears exactly pages_seen times.
“Carried forward” subtotals summed twice. Vendors often print a running subtotal at each page break. If parse_money sees that value in the amount column, it is added on top of the lines it already summarizes. Filter carried-forward rows explicitly, as is_structural_noise does — do not rely on position alone.
extract_table() returns None on a text-only page. Cover sheets and remit-to pages have no table; the or [] guard prevents a TypeError, but confirm those pages contribute zero line items rather than being silently skipped when they do hold data.
Column drift between pages. A vendor may shift a column by one on later pages. Reading row[-1] for the amount is more robust than a fixed index, but validate that the last cell is actually money via parse_money returning non-None, not by trusting position.
A single logical row split across a page break. A long description can wrap so the amount lands on the next physical page. Detect an expense row whose amount cell is empty and buffer its description until the continuation supplies the figure.
float anywhere in the money path. A single float(cell) upstream reintroduces binary drift that a hundred summed rows will amplify past the one-cent tolerance. Keep the entire amount path in Decimal.
Unbounded concurrency on large batches. Streaming bounds one file’s memory, but opening hundreds of multi-page PDFs at once does not. Cap concurrency with the bounded worker pool from async batch processing for high-volume invoices.

Verification

Confirm correctness by asserting on the two properties that page-stitching most often violates: the header count and the reconciliation identity. Build fixtures from real multi-page vendor PDFs whose grand total you know, then prove that stitching reproduces it exactly.

from decimal import Decimal


def test_stitched_invoice_reconciles_to_grand_total() -> None:
    invoice = stitch_invoice("fixtures/janitorial_5page.pdf")
    # Grand total printed on the final page of the source statement.
    assert reconcile(invoice, declared_total=Decimal("18452.16"))
    assert invoice.pages_seen == 5


def test_repeated_header_is_not_counted_as_a_line() -> None:
    invoice = stitch_invoice("fixtures/landscaping_3page.pdf")
    # The header text must never appear as an extracted expense description.
    assert all("description" not in li.description.lower() for li in invoice.line_items)


def test_carried_forward_subtotal_is_excluded() -> None:
    invoice = stitch_invoice("fixtures/hvac_4page_carryforward.pdf")
    # If a carried-forward subtotal leaked in, the total would roughly double.
    assert reconcile(invoice, declared_total=Decimal("9310.00"))

The reconciliation assertion is the one that matters most: a stitched invoice that ties to its printed grand total, page count and all, is one you can hand to the next stage with confidence. Any mismatch localizes the defect to extraction rather than letting it masquerade as an allocation error three stages downstream, where the pro rata allocation algorithms would spread a phantom charge across every tenant. A rising reconciliation-failure rate for one vendor is an early signal that their multi-page template changed — cheaper to catch here than in a failed year-end tie-out.

Once a stitched invoice reconciles, hand it upward to coordinate-aware table extraction with pdfplumber for typed schema enforcement; if your statements arrive as scans with rotated or borderless tables, continue with parsing complex CAM invoices with Tabula and Pandas.

Coordinate-aware table extraction with pdfplumber — the parent extraction stage this page extends for the multi-page case.
Parsing complex CAM invoices with Tabula and Pandas — the sibling approach for scanned, rotated, or borderless tables.
Async batch processing for high-volume invoices — bounds concurrency when stitching hundreds of multi-page PDFs in a single close window.

Handling Multi-Page Commercial Invoices in Python

Context & When to Use This Approach #

Step-by-Step Implementation #

Step 1 — Stream pages instead of materializing the whole document #

Step 2 — Fingerprint the header so continuation pages skip it #

Step 3 — Parse money defensively into two-place Decimals #

Step 4 — Stitch pages into one logical invoice #

Step 5 — Reconcile the stitched total against the declared grand total #

Gotchas & Known Limitations #

Verification #

Related #