From Rules to Reasoning: Building a B2B Document Extractor with OCR and LLMs

Overview

Extracting structured data from B2B documents like purchase orders, invoices, or delivery notes is a common challenge in enterprise automation. Traditionally, rule-based systems using OCR (Optical Character Recognition) have been the go‑to solution. However, with the rise of large language models (LLMs), a new approach emerges: using a local LLM via Ollama to parse document content with natural language understanding. This tutorial rebuilds the same B2B document extractor twice—once with pytesseract and rule-based parsing, and once with an LLM (LLaMA 3) hosted on Ollama—using a realistic order scenario. You’ll learn the strengths, weaknesses, and practical implementation of both methods.

From Rules to Reasoning: Building a B2B Document Extractor with OCR and LLMs
Source: towardsdatascience.com

Prerequisites

Before diving in, ensure your environment is ready:

Install dependencies with:

pip install pytesseract pillow opencv-python pandas requests

Step-by-Step Instructions

1. Sample Document Preparation

Create a sample PDF (or use a real scanned order) containing typical fields: Purchase Order Number, Vendor Name, Item Description, Quantity, Unit Price, and Total Amount. For demonstration, we’ll use a clean image of an order form. Save it as order.png.

2. Rule-Based Extraction with pytesseract

This approach relies on precise template matching and regex patterns. We’ll break it into steps:

2.1 OCR the image

import pytesseract
from PIL import Image

image = Image.open('order.png')
text = pytesseract.image_to_string(image)
print(text)

2.2 Parse key fields using regular expressions

Assume the document has known labels like “PO#:”, “Vendor:”, etc.

import re

def extract_rule_based(text):
    data = {}
    # PO number
    po_match = re.search(r'PO#:\s*(\w+)', text)
    if po_match:
        data['po_number'] = po_match.group(1)
    # Vendor
    vendor_match = re.search(r'Vendor:\s*(.+)', text)
    if vendor_match:
        data['vendor'] = vendor_match.group(1).strip()
    # Items: assume a table with lines like "Item: ... Qty: ... Price: ..."
    item_pattern = r'Item:\s*(.+?)\s*Qty:\s*(\d+)\s*Price:\s*(\d+\.?\d*)'
    items = re.findall(item_pattern, text)
    data['items'] = [{'desc': i[0], 'qty': int(i[1]), 'price': float(i[2])} for i in items]
    return data

result_rules = extract_rule_based(text)
print(result_rules)

2.3 Handle alignment issues

Use OpenCV to deskew the image before OCR if needed.

3. LLM-Based Extraction with Ollama & LLaMA 3

Instead of rigid patterns, we let the LLM interpret the raw text. The local LLM runs via Ollama’s API.

3.1 Send OCR text to Ollama

import requests
import json

def extract_with_llm(raw_text):
    prompt = f"""Extract the following fields from this order document and return them in JSON format with keys: po_number, vendor, items (list of objects with desc, qty, price). Document text:
{raw_text}
"""
    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': 'llama3',
            'prompt': prompt,
            'stream': False
        }
    )
    result = response.json()['response']
    # Try to extract JSON from the response
    try:
        # find JSON block
        start = result.index('{')
        end = result.rindex('}') + 1
        json_str = result[start:end]
        return json.loads(json_str)
    except:
        return {'error': 'Failed to parse LLM output', 'raw': result}

3.2 Process the image

# Same ocr step
raw_text = pytesseract.image_to_string(image)
result_llm = extract_with_llm(raw_text)
print(result_llm)

4. Testing Both on a Realistic B2B Order

Use a document with varying layouts (e.g., handwritten numbers, rotated text). The rule-based method struggles if labels move; the LLM is more resilient but may hallucinate.

From Rules to Reasoning: Building a B2B Document Extractor with OCR and LLMs
Source: towardsdatascience.com

Common Mistakes & Pitfalls

Summary

Both rule‑based and LLM‑based document extractors have their place. The rule‑based approach is deterministic, fast, and cheap, but brittle. The LLM approach offers flexibility and understanding, but requires careful prompt design, offline models, and more computing power. For B2B documents with stable layouts, stick with rules. For variable or messy documents, an LLM can save countless hours of manual rule maintenance. This tutorial gave you a working foundation for both—choose wisely based on your use case.

Tags:

Recommended

Discover More

The AUTEUR: A Distraction-Free E Ink Typewriter with a Mechanical KeyboardOcean-Based Carbon Removal Experiment Underway in Halifax Harbor – Can the Sea Save Us from CO₂?VS Code Python Extension Unveils Game-Changing Code Navigation and Blazing-Fast IndexingHow to Forecast and Reduce the Threat of Rodent-Borne Arenaviruses Under Climate Change7 Critical Facts About the xrdp RCE Vulnerability (CVE-2025-68670)