Async Batch Processing for High-Volume Invoices

Q: Why run pdfplumber in a ProcessPoolExecutor instead of a thread pool?

pdfplumber parses PDF drawing operators in pure Python, which holds the GIL, so multiple threads parsing at once run effectively serially. A process pool gives each parse its own interpreter and its own core, achieving real parallelism. Threads are the right tool only for the I/O-bound posting stage, where the GIL is released during network waits.

Q: Can the batch runner preserve ordering for corrections and credits?

Yes, but only per property. Partition the queue by property_id so invoices, corrections, and credits for one asset apply in receipt order, while distinct properties still process fully in parallel. Global ordering across the whole portfolio is neither needed nor worth the throughput it would cost.

A single mid-size commercial portfolio can receive several thousand vendor invoices in a reconciliation month, and every one of them has to be extracted, validated, coded, and posted before a CAM statement can close. Processing those documents one at a time — open PDF, parse, wait on the database, repeat — turns a mechanical task into the critical-path bottleneck of the entire close. This is the concurrency layer of the Automated Invoice Parsing & Data Ingestion pipeline: it takes the same per-document logic used elsewhere in the ingestion stack and runs it in parallel without corrupting tenant-level allocations, exhausting memory, or dropping documents on the floor when a vendor’s PDF is malformed. The goal is deterministic, auditable throughput — not raw speed for its own sake, but a batch that finishes inside the close window and can prove exactly what it did to each document.

Prerequisites & Data Contracts

Async batch processing is an orchestration layer, not a parser in its own right. It assumes the per-document building blocks already exist and are individually correct; concurrency only multiplies whatever behavior those blocks have. Before wiring up a batch runner, three upstream contracts must be in place.

First, a deterministic single-document parser. The batch layer calls the coordinate-aware extraction described in PDF invoice extraction with Python and pdfplumber — or the Tabula-and-pandas path in parsing complex CAM invoices with Tabula and pandas — as a pure function of one file path. If a parser mutates shared state or writes to a fixed temp path, running two copies concurrently will race; the batch layer cannot fix that, only expose it.

Second, a validated record schema. Every document that survives parsing must resolve to the same typed structure enforced by schema validation for parsed expense data. The batch runner treats validation as a hard gate: a record either conforms to the contract and continues, or it is routed to a quarantine path. It never posts an unvalidated payload.

Third, an idempotency key per document. Because retries and at-least-once queue delivery mean the same invoice can be handed to a worker more than once, each document needs a stable key — typically a SHA-256 of the source bytes combined with the vendor and invoice number — so that duplicate processing collapses to a single ledger posting. Downstream, GL code mapping for CAM expenses and the reconciliation engine both rely on that key to deduplicate.

The data contract the batch layer moves through the system looks like this:

from __future__ import annotations

from dataclasses import dataclass
from decimal import Decimal
from datetime import date
from enum import Enum


class DocStatus(str, Enum):
    PENDING = "PENDING"
    PARSED = "PARSED"
    VALIDATED = "VALIDATED"
    POSTED = "POSTED"
    QUARANTINED = "QUARANTINED"


@dataclass(frozen=True)
class BatchDocument:
    """One unit of work flowing through the async batch runner."""
    idempotency_key: str          # sha256(source_bytes) + vendor + invoice_number
    source_path: str              # local or object-store path to the raw PDF
    property_id: str              # which asset this invoice belongs to
    service_period: date          # GAAP period-matching, not the billed date
    status: DocStatus = DocStatus.PENDING

Note that money never appears as float anywhere in this pipeline. Line-item amounts are carried as Decimal from the moment they leave the parser, because summing thousands of float cent values drifts the recoverable pool by fractions of a cent — enough, across a portfolio, to make a reconciliation fail to tie out.

Concurrency Model & Throughput Design

The core design decision is how many documents to process at once. Too few and the batch never finishes inside the close window; too many and workers contend for CPU, exhaust memory, or overwhelm the database and ERP with connections. The right number is a sizing problem, not a guess.

Invoice processing is a mix of two workloads with opposite optimal strategies. PDF parsing is CPU-bound — pdfplumber interprets PDF drawing operators in pure Python, so it is limited by cores and blocked by the GIL. Database writes and ERP posting are I/O-bound — they spend almost all their wall-clock time waiting on the network. A single concurrency limit cannot serve both, which is why the architecture separates them: a ProcessPoolExecutor sized to the CPU count for parsing, and an asyncio event loop with a much larger semaphore for I/O.

For the I/O tier, the useful relationship is Little’s Law, which ties the average number of in-flight requests to arrival rate and latency:

L = \lambda \times W

where $L$ is the number of concurrent requests, $\lambda$ is the target throughput in documents per second, and $W$ is the average per-document latency in seconds. To hit a target throughput given a known posting latency, size the semaphore to:

N_{\text{io}} = \left\lceil \lambda_{\text{target}} \times W_{\text{post}} \right\rceil

If year-end volume demands 40 postings per second and each ERP write averages 250 ms, roughly $\lceil 40 \times 0.25 \rceil = 10$ concurrent connections keep the pipeline saturated without piling up. The CPU tier is sized independently to the number of physical cores, because adding parse workers beyond that only causes context-switch thrashing:

N_{\text{cpu}} = \min\!\left(\text{cores},\ \left\lceil \frac{\text{batch size}}{\text{docs per worker}} \right\rceil\right)

Memory is the third constraint and the one that silently kills long batches. Peak resident memory scales with the number of documents parsed simultaneously, not the batch size, so the process-pool width is also a memory budget:

\text{peak memory} \approx N_{\text{cpu}} \times \text{avg parsed doc size}

A 40-page utility statement can hold tens of megabytes of extracted drawing objects in flight; eight parallel copies of that is a memory spike that a naive asyncio.gather over the whole batch would trigger all at once. Bounding concurrency with a semaphore is what keeps that spike flat.

Python Implementation

The runner has three cooperating pieces: a semaphore-bounded dispatcher on the event loop, a process pool for the CPU-bound parse, and a single async path that carries each document from parse through validation to posting. The dispatcher never blocks the loop — CPU work is offloaded with run_in_executor, and the semaphore caps how many documents are ever resident at once.

import asyncio
from concurrent.futures import ProcessPoolExecutor
from decimal import Decimal
from typing import Sequence

# Sized to physical cores; also the memory budget for parallel parses.
_PARSE_POOL = ProcessPoolExecutor()


async def process_document(
    doc: BatchDocument,
    parse_gate: asyncio.Semaphore,
    post_gate: asyncio.Semaphore,
) -> BatchResult:
    """Carry one invoice from raw PDF to posted GL entry.

    parse_gate bounds CPU-bound parsing (memory budget); post_gate
    bounds I/O-bound ERP writes (connection budget). Both are honored
    per document so a single bad file never aborts the batch.
    """
    loop = asyncio.get_running_loop()
    try:
        # CPU-bound: offload to the process pool under the parse budget.
        async with parse_gate:
            line_items = await loop.run_in_executor(
                _PARSE_POOL, extract_line_items_sync, doc.source_path
            )

        # In-process, cheap: enforce the data contract before any write.
        record = validate_invoice(doc, line_items)  # raises on contract breach

        # I/O-bound: post under the connection budget, keyed for idempotency.
        async with post_gate:
            await post_to_erp(record, idempotency_key=doc.idempotency_key)

        return BatchResult(doc.idempotency_key, DocStatus.POSTED, error=None)

    except ValidationError as exc:
        # Contract breach: quarantine, never post, keep the batch alive.
        return BatchResult(doc.idempotency_key, DocStatus.QUARANTINED, error=str(exc))


async def run_batch(
    docs: Sequence[BatchDocument],
    parse_concurrency: int = 4,
    post_concurrency: int = 10,
) -> list[BatchResult]:
    """Dispatch a whole batch under bounded concurrency.

    return_exceptions is unnecessary because process_document already
    converts per-document failure into a QUARANTINED result, so one
    malformed vendor PDF cannot cancel its siblings.
    """
    parse_gate = asyncio.Semaphore(parse_concurrency)
    post_gate = asyncio.Semaphore(post_concurrency)
    tasks = [
        process_document(doc, parse_gate, post_gate)
        for doc in docs
    ]
    return await asyncio.gather(*tasks)

The CPU-bound extraction that the pool executes is an ordinary synchronous function — it must be, because a process pool pickles and runs plain callables, not coroutines. Keeping it pure and path-based (rather than passing in-memory bytes) avoids duplicating large buffers across the process boundary:

import pdfplumber
from decimal import Decimal


def extract_line_items_sync(pdf_path: str) -> list[LineItemRaw]:
    """CPU-bound extraction, executed inside a ProcessPoolExecutor worker.

    Runs in its own process to sidestep the GIL; opens the file by path
    so the large parsed page objects live and die inside this worker
    rather than crossing back to the event loop.
    """
    items: list[LineItemRaw] = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            for row in page.extract_table() or []:
                if row and _is_expense_row(row):
                    items.append(
                        LineItemRaw(
                            description=str(row[0]).strip(),
                            amount=Decimal(_clean_money(row[-1])),
                        )
                    )
            page.flush_cache()  # release per-page objects immediately
    return items

Streaming the queue layer keeps document bytes out of the event-loop process entirely. aiofiles reads source PDFs in fixed chunks when staging them into the broker, so the loop process never holds a full multi-megabyte document in memory — the actual parse happens later, in a pool worker, against the staged path:

import aiofiles
from typing import AsyncIterator


async def stream_to_broker(
    file_path: str, chunk_size: int = 8192
) -> AsyncIterator[bytes]:
    """Yield raw PDF bytes in fixed chunks for queue-level streaming.

    Decouples ingestion from parsing: the event loop stages bytes without
    ever materializing the whole document, keeping loop-process memory flat.
    """
    async with aiofiles.open(file_path, mode="rb") as fh:
        while chunk := await fh.read(chunk_size):
            yield chunk

Validation Rules & Edge Cases

Concurrency turns rare, tolerable single-document failures into systematic ones, so the batch layer has to defend against a specific set of failure modes that only appear at volume.

A single malformed PDF must not abort the batch. The most common beginner mistake is asyncio.gather(*tasks) where any coroutine can raise: one corrupted vendor file cancels every sibling still in flight, and a 3,000-document batch dies on document 1,742. The implementation above avoids this by catching failures inside process_document and converting them to a QUARANTINED result, so partial success is the normal, expected outcome — the operator reviews the quarantine set, not a stack trace.

Transient failures need bounded retries with backoff — permanent ones do not. A network blip posting to the ERP should retry; a schema violation should not, because retrying it will fail identically forever and only clog the queue. Separate the two by exception type and back off exponentially so a struggling ERP is not hammered by a thundering herd of retries:

import tenacity


@tenacity.retry(
    stop=tenacity.stop_after_attempt(4),
    wait=tenacity.wait_exponential(multiplier=1, min=2, max=20)
        + tenacity.wait_random(0, 1),          # jitter avoids synchronized retries
    retry=tenacity.retry_if_exception_type((ConnectionError, TimeoutError)),
    reraise=True,
)
async def post_to_erp(record: ParsedInvoice, *, idempotency_key: str) -> None:
    """Post one validated invoice; retries transient I/O only.

    The idempotency_key makes at-least-once delivery safe: a retry after
    a partially-applied write collapses to the same single GL posting.
    """
    await _async_http_post(
        "/api/v1/gl/posting",
        payload=record.model_dump(mode="json"),
        headers={"Idempotency-Key": idempotency_key},
    )

Duplicate delivery must collapse to one posting. Message brokers deliver at least once; a retry can fire after a write actually succeeded but before its acknowledgment landed. Without the idempotency key threaded through to the ERP, that double-posts an expense into the recoverable pool and inflates every tenant’s share. The key is what makes at-least-once delivery safe.

Documents that exhaust retries go to a dead-letter path, not oblivion. After the final attempt, a failing document is written to a dead-letter queue with its raw payload, exception trace, and idempotency key attached, so it surfaces on a reconciliation dashboard rather than vanishing. In CRE, a silently dropped invoice is an under-recovery that a tenant will never flag and an auditor eventually will.

Ordering guarantees are per-property, not global. Two invoices for different assets can process in any order, but corrections and credits against the same property should apply in receipt order. Partition the queue by property_id so that within a property the sequence is preserved, while distinct properties still run fully in parallel.

Integration Points

The batch runner sits in the middle of the ingestion stack, and its output has to hand off cleanly to three downstream consumers.

The validated records it emits are the exact contract the reconciliation engine expects, so posting is a direct write — no reshaping between the batch layer and GL code mapping for CAM expenses, which assigns each normalized line item to the chart of accounts and splits it into recoverable versus non-recoverable pools. From there the coded expenses feed pro rata allocation, where each tenant’s share is computed against the recoverable pool the batch just populated.

The quarantine and dead-letter sets feed the review workflow described in automating vendor invoice classification; a document the batch could not validate or post is exactly a document a human needs to see, with its source and error already attached.

The audit trail is a first-class output, not a byproduct. Each document’s idempotency key, source hash, status transitions, and final posting are written to an append-only log, so an auditor can trace any recoverable-pool figure back to the specific invoice and the specific batch run that posted it. That chain of custody is what makes an automated close defensible.

Testing & Verification

Concurrency bugs are non-deterministic by nature, so the test strategy has to force the conditions that expose them rather than hope they appear.

Assert on isolation, not just totals. The key correctness property is that a poisoned document does not take down its siblings. Seed a batch with one deliberately malformed PDF among valid ones and assert that every valid document reaches POSTED while exactly the bad one lands in QUARANTINED:

import pytest


@pytest.mark.asyncio
async def test_one_bad_doc_does_not_abort_batch() -> None:
    docs = make_valid_docs(count=20) + [make_corrupt_doc()]

    results = await run_batch(docs, parse_concurrency=4, post_concurrency=8)

    posted = [r for r in results if r.status is DocStatus.POSTED]
    quarantined = [r for r in results if r.status is DocStatus.QUARANTINED]
    assert len(posted) == 20
    assert len(quarantined) == 1
    assert all(r.status is not DocStatus.PENDING for r in results)  # nothing stuck

Prove idempotency directly. Deliver the same document twice and assert the ERP saw one effective posting. A fake ERP that records idempotency keys makes this deterministic without a live system:

@pytest.mark.asyncio
async def test_duplicate_delivery_posts_once(fake_erp) -> None:
    doc = make_valid_docs(count=1)[0]

    await run_batch([doc, doc], parse_concurrency=2, post_concurrency=2)

    assert fake_erp.distinct_postings == 1  # duplicate collapsed by the key

Verify money with Decimal, never float. Reconciliation tests must assert on exact quantized values; a tolerance-based float comparison hides exactly the cent drift the pipeline exists to prevent. Sum the posted line items and assert equality against a Decimal computed by hand:

from decimal import Decimal


def test_recoverable_pool_ties_out(posted_records) -> None:
    total = sum((li.amount for r in posted_records for li in r.line_items),
                start=Decimal("0.00"))
    assert total == Decimal("184920.57")  # exact, penny-accurate

Load-test the concurrency ceiling. Run a batch larger than the process pool with a memory probe to confirm peak resident memory tracks N_cpu × avg doc size and does not scale with batch size — the proof that the semaphore is actually bounding parallel parses rather than letting gather fan out unbounded.

Where This Fits

Async batch processing is what lets the ingestion pipeline survive contact with real portfolio volume. By separating CPU-bound parsing from I/O-bound posting, bounding each with a concurrency budget derived from throughput and memory math, and converting every per-document failure into a quarantine or dead-letter outcome rather than a batch-wide abort, the runner turns month-end from a serial bottleneck into a predictable, auditable window. It depends on a correct single-document parser from pdfplumber extraction, enforces the contract set by schema validation for parsed expense data, and feeds validated records into GL code mapping and downstream pro rata allocation. The one architectural constraint worth memorizing: PDF parsing is CPU-bound and belongs in a ProcessPoolExecutor, while posting is I/O-bound and belongs on the event loop — conflate them under a single concurrency limit and the pipeline either starves or thrashes.

Frequently Asked Questions

Why run pdfplumber in a ProcessPoolExecutor instead of a thread pool? pdfplumber parses PDF drawing operators in pure Python, which holds the GIL, so multiple threads parsing at once run effectively serially. A process pool gives each parse its own interpreter and its own core, achieving real parallelism. Threads are the right tool only for the I/O-bound posting stage, where the GIL is released during network waits.

How do I stop one corrupted vendor PDF from killing the whole batch? Catch failures inside the per-document coroutine and return a QUARANTINED result instead of letting the exception propagate into asyncio.gather. A bare gather cancels sibling tasks when any coroutine raises, so a single bad file would abort thousands of good ones. Converting failure to a result value makes partial success the normal outcome and keeps the batch alive.

What concurrency limit should I set? Size the two tiers separately. Set parse concurrency to the number of physical cores, because that stage is CPU-bound and also memory-bounded. Set posting concurrency from Little’s Law — target throughput multiplied by average posting latency — so connections stay saturated without piling up. A single shared limit cannot serve both a CPU-bound and an I/O-bound workload.

How does the pipeline avoid double-posting an invoice during retries? Every document carries an idempotency key — a hash of its source bytes combined with vendor and invoice number — that is passed through to the ERP on every write. Message brokers deliver at least once, so a retry can fire after a write already succeeded; the key lets the ERP recognize the repeat and collapse it to a single GL posting rather than double-charging the recoverable pool.

Can this batch runner preserve ordering for corrections and credits? Yes, but only per property. Partition the queue by property_id so invoices, corrections, and credits for one asset apply in receipt order, while distinct properties still process fully in parallel. Global ordering across the whole portfolio is neither needed nor worth the throughput it would cost.

PDF invoice extraction with Python and pdfplumber — the deterministic single-document parser the batch layer runs in parallel.
Schema validation for parsed expense data — the contract every document must satisfy before the runner will post it.
GL code mapping for CAM expenses — the downstream stage that codes and splits the records this batch emits.
Parsing complex CAM invoices with Tabula and pandas — an alternate CPU-bound parse path for table-heavy vendor statements.
Automated Invoice Parsing & Data Ingestion — the parent pipeline this concurrency layer belongs to.

Async Batch Processing for High-Volume Invoices

Prerequisites & Data Contracts #

Concurrency Model & Throughput Design #

Python Implementation #

Validation Rules & Edge Cases #

Integration Points #

Testing & Verification #

Where This Fits #

Frequently Asked Questions #

Related #