From Rules to Reasoning: Building a B2B Document Extractor with OCR and LLMs

Overview

Extracting structured data from B2B documents like purchase orders, invoices, or delivery notes is a common challenge in enterprise automation. Traditionally, rule-based systems using OCR (Optical Character Recognition) have been the go‑to solution. However, with the rise of large language models (LLMs), a new approach emerges: using a local LLM via Ollama to parse document content with natural language understanding. This tutorial rebuilds the same B2B document extractor twice—once with pytesseract and rule-based parsing, and once with an LLM (LLaMA 3) hosted on Ollama—using a realistic order scenario. You’ll learn the strengths, weaknesses, and practical implementation of both methods.

From Rules to Reasoning: Building a B2B Document Extractor with OCR and LLMs — Source: towardsdatascience.com

Prerequisites

Before diving in, ensure your environment is ready:

Python 3.9+ installed
Tesseract OCR engine (installation guide)
Ollama (download) and the LLaMA 3 model pulled (ollama pull llama3)
Python packages: pytesseract, Pillow, opencv-python, pandas, requests (for Ollama API)

Install dependencies with:

pip install pytesseract pillow opencv-python pandas requests

Step-by-Step Instructions

1. Sample Document Preparation

Create a sample PDF (or use a real scanned order) containing typical fields: Purchase Order Number, Vendor Name, Item Description, Quantity, Unit Price, and Total Amount. For demonstration, we’ll use a clean image of an order form. Save it as order.png.

2. Rule-Based Extraction with pytesseract

This approach relies on precise template matching and regex patterns. We’ll break it into steps:

2.1 OCR the image

import pytesseract
from PIL import Image

image = Image.open('order.png')
text = pytesseract.image_to_string(image)
print(text)

2.2 Parse key fields using regular expressions

Assume the document has known labels like “PO#:”, “Vendor:”, etc.

import re

def extract_rule_based(text):
    data = {}
    # PO number
    po_match = re.search(r'PO#:\s*(\w+)', text)
    if po_match:
        data['po_number'] = po_match.group(1)
    # Vendor
    vendor_match = re.search(r'Vendor:\s*(.+)', text)
    if vendor_match:
        data['vendor'] = vendor_match.group(1).strip()
    # Items: assume a table with lines like "Item: ... Qty: ... Price: ..."
    item_pattern = r'Item:\s*(.+?)\s*Qty:\s*(\d+)\s*Price:\s*(\d+\.?\d*)'
    items = re.findall(item_pattern, text)
    data['items'] = [{'desc': i[0], 'qty': int(i[1]), 'price': float(i[2])} for i in items]
    return data

result_rules = extract_rule_based(text)
print(result_rules)

2.3 Handle alignment issues

Use OpenCV to deskew the image before OCR if needed.

3. LLM-Based Extraction with Ollama & LLaMA 3

Instead of rigid patterns, we let the LLM interpret the raw text. The local LLM runs via Ollama’s API.

3.1 Send OCR text to Ollama

import requests
import json

def extract_with_llm(raw_text):
    prompt = f"""Extract the following fields from this order document and return them in JSON format with keys: po_number, vendor, items (list of objects with desc, qty, price). Document text:
{raw_text}
"""
    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': 'llama3',
            'prompt': prompt,
            'stream': False
        }
    )
    result = response.json()['response']
    # Try to extract JSON from the response
    try:
        # find JSON block
        start = result.index('{')
        end = result.rindex('}') + 1
        json_str = result[start:end]
        return json.loads(json_str)
    except:
        return {'error': 'Failed to parse LLM output', 'raw': result}

3.2 Process the image

# Same ocr step
raw_text = pytesseract.image_to_string(image)
result_llm = extract_with_llm(raw_text)
print(result_llm)

4. Testing Both on a Realistic B2B Order

Use a document with varying layouts (e.g., handwritten numbers, rotated text). The rule-based method struggles if labels move; the LLM is more resilient but may hallucinate.

Common Mistakes & Pitfalls

Ignoring OCR quality: Garbage in, garbage out. Both methods require clean OCR. Preprocess images (binarization, contrast adjustment).
Over‑fitting regex patterns: Rule-based parsers break with tiny layout changes. Always test on multiple document variants.
Prompt engineering failures: LLMs need explicit instructions. Forgetting to restrict output format can yield verbose or malformed responses.
Not handling edge cases: Missing fields, multiple tables, or non‑English text. Rule systems need fallbacks; LLMs may need retry logic.
Performance overhead: LLM inference is slower and more resource‑intensive than regex. For high‑volume processing, rules are faster.

Summary

Both rule‑based and LLM‑based document extractors have their place. The rule‑based approach is deterministic, fast, and cheap, but brittle. The LLM approach offers flexibility and understanding, but requires careful prompt design, offline models, and more computing power. For B2B documents with stable layouts, stick with rules. For variable or messy documents, an LLM can save countless hours of manual rule maintenance. This tutorial gave you a working foundation for both—choose wisely based on your use case.

Tags: