Document Processing Pipeline

Build a document processing pipeline that extracts data from PDFs, images, and scanned documents, then structures it for analysis and storage.

What you will build

A document processing system with PDF parsing, OCR, data extraction, and structured storage

Prerequisites

Getting Started with Claude Code

In this guide

Why document processing is a business bottleneck
PDF parsing and text extraction
Intelligent data extraction with patterns
Processing pipeline and queue management
Search, review, and correction interface
Integration, export, and deployment

Why document processing is a business bottleneck

Every business drowns in documents — invoices, contracts, receipts, reports, forms, and correspondence. Most of these arrive as PDFs, scanned images, or emails. Extracting useful data from them is tedious manual work: someone opens each document, reads it, types the relevant data into a spreadsheet or system, and files the original. This process is slow, error-prone, and does not scale. A document processing pipeline automates the extraction. Documents go in, structured data comes out. An invoice becomes a row in your accounting system with vendor name, amount, date, and line items. A contract becomes searchable metadata with parties, terms, dates, and obligations. Ask Claude Code: Create a Node.js project with TypeScript for a document processing pipeline. Define the core types at src/types.ts. Document (id, filename, mimeType as application/pdf or image/png or image/jpeg, uploadedAt, processedAt optional, status as uploaded or processing or processed or failed, extractedData as JSON, confidence as number 0 to 1, originalPath, textContent). InvoiceData (vendorName, invoiceNumber, date, dueDate, lineItems as array of LineItem with description and quantity and unitPrice and total, subtotal, taxAmount, totalAmount, currency). ContractData (parties as array of strings, effectiveDate, expirationDate, value optional, keyTerms as array of strings, governingLaw, signatureRequired boolean). Set up the project structure: src/processors/ for document type processors, src/extractors/ for data extraction, and src/storage/ for saving results. Ask Claude Code: Create a document upload endpoint at src/app/api/documents/route.ts that accepts file uploads via FormData, validates the file type and size (max 10MB), saves the file to a processing directory, creates a Document record with status uploaded, and returns the document ID.

PDF parsing and text extraction

PDFs are the most common document format in business. There are two types: text-based PDFs (created from digital documents — text is embedded and extractable) and image-based PDFs (scanned documents — the PDF contains an image, not text). Your pipeline must handle both. Ask Claude Code: Install PDF processing libraries: npm install pdf-parse. Create a PDF processor at src/processors/pdf.ts. For text-based PDFs: use pdf-parse to extract the raw text content. The extracted text preserves the reading order but loses formatting (tables become jumbled text, columns merge). For structured extraction from text PDFs, parse the raw text using patterns. Ask Claude Code: Create a text parser at src/extractors/text-parser.ts. Given raw text from a PDF, extract structured data using regex patterns and heuristics. For invoices: find the invoice number (usually near the top, after Invoice or Invoice Number or Inv), find dates (multiple date formats: DD/MM/YYYY, MM/DD/YYYY, YYYY-MM-DD, 15 March 2026), find monetary amounts (with currency symbols or codes), and find line items (usually in a table-like structure with description, quantity, price, and total columns). Ask Claude Code: Build a table detector for PDF text. Tables in extracted text appear as lines with consistent spacing or delimiter patterns. Detect table boundaries, identify columns by analysing whitespace patterns, and parse each row into structured data. This is imperfect but handles 70 to 80 percent of common invoice and report formats. For image-based PDFs, you need OCR (Optical Character Recognition). Ask Claude Code: Install Tesseract.js for OCR: npm install tesseract.js. Create an OCR processor at src/processors/ocr.ts. Convert the PDF page to an image, run Tesseract OCR on the image, and return the extracted text. Set the language to English (or configure for your needs). Tesseract returns a confidence score for each word — log the overall confidence and flag documents below 80 percent for manual review. Common error: OCR quality depends heavily on scan quality. Rotated pages, poor contrast, handwriting, and unusual fonts all reduce accuracy. Pre-process images before OCR: deskew (straighten rotated pages), increase contrast, and convert to grayscale.

Intelligent data extraction with patterns

Raw text from a PDF is a starting point. The real value is in structured extraction — turning unstructured text into typed data objects. Ask Claude Code: Create an invoice extractor at src/extractors/invoice.ts. The extractor takes raw text and returns an InvoiceData object. Build it in layers. Layer 1 — Header extraction: find the vendor name (usually the largest text at the top, or text near a logo), find the invoice number (pattern: Invoice followed by a number, common formats include INV-0001, 2026/001, and plain numbers), and find dates (the invoice date and due date, distinguished by context words like Due, Payment Due, and Due Date). Layer 2 — Line item extraction: detect the table region (text between header keywords like Description and footer keywords like Subtotal), parse each line into components (description text, numeric quantity, unit price, and line total), and validate that quantity times unit price equals the line total (within rounding tolerance). Layer 3 — Summary extraction: find the subtotal (sum before tax), tax amount (with tax rate if shown), and total amount (final amount due). Cross-validate: the sum of line item totals should equal the subtotal, and subtotal plus tax should equal the total. Create a contract extractor. Ask Claude Code: Build src/extractors/contract.ts. Extract: party names (usually in the first paragraph, after phrases like between and and, or Party A and Party B), effective date and expiration date, contract value (monetary amounts with context indicating the total contract value, not a payment schedule amount), key terms and clauses (identify sections by heading patterns — numbered sections, bold text, all-caps headings), and termination conditions. Add a confidence scoring system. Ask Claude Code: For each extracted field, calculate a confidence score. Exact regex matches get 0.9 to 1.0. Heuristic matches (likely the right field based on position and context) get 0.6 to 0.8. Ambiguous extractions (multiple possible matches) get 0.3 to 0.5. Missing fields get 0.0. The overall document confidence is the average of all field confidences. Flag documents below 0.7 for human review. Common error: date format ambiguity. Is 01/02/2026 January 2nd or February 1st? Use context clues (other dates in the document, the vendor's country, the language) to determine the format. When ambiguous, flag for review.

Processing pipeline and queue management

Documents arrive continuously and processing takes time. A queue system ensures documents are processed reliably without overwhelming your resources. Ask Claude Code: Create a processing pipeline at src/pipeline.ts. The pipeline has four stages: intake (validate the document, detect the type, and add to the processing queue), extraction (pull text from the document using the appropriate processor — PDF parser or OCR), analysis (run the appropriate extractor — invoice, contract, or generic — based on the document type), and storage (save the extracted data to the database and index for search). Build the queue. Ask Claude Code: Create a simple queue system at src/lib/queue.ts using a database table. Each queue item has: documentId, stage (intake, extraction, analysis, storage), status (pending, processing, completed, failed), attempts (retry count), createdAt, startedAt, and completedAt. A worker function polls the queue every 5 seconds, picks the oldest pending item, processes it, and updates the status. If processing fails, increment the attempt count and retry up to 3 times with exponential backoff. After 3 failures, mark as failed and alert an operator. Add document type detection. Ask Claude Code: Create a classifier at src/extractors/classifier.ts. Given the extracted text from a document, determine its type: invoice (contains keywords like invoice, amount due, payment terms, and line item patterns), contract (contains keywords like agreement, parties, hereby, terms and conditions, and signature blocks), receipt (contains keywords like receipt, paid, thank you for your purchase, and a single total amount), and report (structured sections with headings, charts references, and analysis language). Use a scoring system: count keyword matches for each type and select the type with the highest score. Add a manual override — if the classifier is wrong, a user can correct the type and the pipeline re-runs with the correct extractor. Add batch processing. Ask Claude Code: Create a batch upload feature that accepts a ZIP file containing multiple documents. Extract the ZIP, add each document to the queue, and process them sequentially. Show a progress dashboard: X of Y documents processed, estimated time remaining, and any documents that failed. Common error: processing PDFs is memory-intensive. A 100-page PDF can consume hundreds of megabytes during OCR. Process pages sequentially (not all at once), free memory between pages, and set a maximum page limit (50 pages) for automatic processing. Larger documents should be split or processed in chunks.

Search, review, and correction interface

Extracted data needs human review, especially for low-confidence extractions. A good review interface makes this fast and painless. Ask Claude Code: Create a document review dashboard at src/app/dashboard/documents/page.tsx. Show a table of all processed documents with: filename, type (invoice, contract, receipt), processing date, confidence score (colour-coded: green above 0.8, yellow 0.5 to 0.8, red below 0.5), status, and a Review button. Sort by confidence ascending so the documents needing the most attention appear first. Build the review interface. Ask Claude Code: When a reviewer clicks a document, show a split view. Left side: the original document rendered as an image (for PDFs, render each page as an image using pdf-to-img or a similar library). Right side: the extracted data in an editable form. For invoices: editable fields for vendor, invoice number, dates, and an editable table for line items. For contracts: editable fields for parties, dates, and key terms. Highlight extracted values on the original document — when the reviewer hovers over a field in the form, highlight the corresponding region in the document image where the value was found. This makes verification fast — glance at the highlight to confirm the extraction is correct. Add a correction workflow. Ask Claude Code: When a reviewer corrects an extracted value, record both the original extraction and the correction. Store corrections as training data: over time, analyse which fields are most often corrected and for which document types. This feedback loop identifies systematic extraction weaknesses. If invoices from Vendor X always have the date extracted wrong, you can add a vendor-specific extraction rule. Add full-text search across all processed documents. Ask Claude Code: Index the extracted text and structured data in a search system. Users can search for: a vendor name (find all invoices from Acme Corp), an amount (find the invoice for 4,500 pounds), a date range (all contracts signed in Q1 2026), or a keyword in the document text. Show search results as document cards with highlighted matching text and the key extracted fields. Common error: rendering PDFs as images in the browser is resource-intensive. Generate the images server-side during processing and cache them. Serve pre-rendered page images rather than converting on every view.

Integration, export, and deployment

The document processing pipeline becomes truly valuable when it connects to your other business systems. Ask Claude Code: Build integration endpoints for common destinations. Accounting integration: when an invoice is processed and reviewed, push the data to your accounting system. Create an export function that formats invoice data as a CSV compatible with Xero, QuickBooks, or FreeAgent import formats (each has specific column requirements). Spreadsheet export: for any document type, export the extracted data as an Excel file (.xlsx) using a library like exceljs. Each document type gets a tailored spreadsheet format with proper column headers, formatting, and formulas (invoice totals should have a SUM formula). API endpoint: create a REST API at src/app/api/documents/[id]/data/route.ts that returns the extracted data as JSON. External systems can poll this endpoint or receive webhook notifications when processing completes. Ask Claude Code: Create a webhook notification system. When a document finishes processing, send a POST request to a configured webhook URL with the document ID, type, confidence score, and extracted data summary. This enables real-time integration with tools like Zapier, Make, or custom systems. Add email integration. Ask Claude Code: Create an email ingestion feature. Monitor a designated email inbox (invoices@yourdomain.com) using an email API. When an email arrives with attachments, download the attachments, add them to the processing queue, and link the extracted data to the email (sender, subject, date). This automates the most common document intake path — forward an invoice email and it is automatically processed. Deploy the pipeline. Ask Claude Code: Configure the project for Vercel deployment with a PostgreSQL database. Set up file storage on Cloudflare R2 or AWS S3 for uploaded documents and rendered images. Configure the processing queue worker as a cron job that runs every minute. Set up monitoring: alert if the queue depth exceeds 50 (processing is falling behind), alert if the failure rate exceeds 10 percent, and send a daily summary of documents processed, average confidence, and any documents stuck in the queue. Test with a batch of 20 real documents (mix of invoices, contracts, and receipts) and verify the extraction accuracy.

Related Lesson