Testing LLM-Generated Code: A Practical Guide to Overcoming Non-Determinism

From Xutepsj, the free encyclopedia of technology

Overview

Traditional software testing assumes you have full control and visibility over the code. But what happens when the code is generated by a large language model (LLM) or runs inside an agentic system like an MCP (Model Context Protocol) server? The code itself becomes a black box—you don't know its exact logic because it's probabilistically produced on the fly. This guide shows how to shift your testing paradigm: move away from deterministic unit tests and toward behavior-driven, property-based approaches that embrace non-determinism. You'll learn about data locality, data construction, and practical techniques for validating LLM-driven agents.

Testing LLM-Generated Code: A Practical Guide to Overcoming Non-Determinism
Source: stackoverflow.blog

Prerequisites

  • Basic knowledge of Python (examples use pytest and hypothesis)
  • Familiarity with LLM APIs (e.g., OpenAI, Anthropic)
  • Understanding of MCP servers (optional but helpful)
  • Installed: pytest, hypothesis, pytest-benchmark (for performance)
  • An LLM API key for execution (consider mocking in CI)

Step-by-Step Testing Strategy

1. Accept Non-Determinism as a Feature

LLMs produce different output for the same input due to temperature settings. Stop trying to assert exact equality. Instead, define invariances: properties that must hold regardless of the exact output. For example:

# Don't do this:
def test_greeting():
    result = llm_chat("Say hello")
    assert result == "Hello!"  # Will fail randomly

# Do this:
from hypothesis import given, strategies as st

def is_greeting(text):
    return any(word in text.lower() for word in ["hello", "hi", "hey"])

@given(st.text())
def test_greeting_property(prompt):
    result = llm_chat(prompt)
    # Property: output should be non-empty if prompt contains request for greeting
    if "hello" in prompt.lower() or "greet" in prompt.lower():
        assert len(result) > 0 and is_greeting(result)

2. Use Property-Based Testing with Hypothesis

Property-based testing generates many random inputs, searching for cases that violate your invariants. This is ideal when the code under test is non-deterministic. Example for an MCP server that returns a JSON schema:

from hypothesis import given, strategies as st
import json

def test_mcp_server_response():
    @given(st.text(min_size=1, max_size=200))
    def _test(prompt):
        response = call_mcp_server(prompt)
        data = json.loads(response)
        # Property 1: always valid JSON
        assert isinstance(data, dict)
        # Property 2: contains key 'result' or 'error'
        assert 'result' in data or 'error' in data
        # Property 3: if no error, 'result' is a string
        if 'result' in data:
            assert isinstance(data['result'], str)
    _test()

3. Leverage Data Locality for Reproducibility

Since LLM outputs vary, store the data used for testing rather than the expected output. Data locality means keeping test inputs and seed values close to the test code. For example, use a fixed random seed for LLM calls during tests:

import os
os.environ['LLM_SEED'] = '42'  # If your LLM library supports seed

# Or use a mock that returns predetermined data based on input features:
from unittest.mock import patch
def test_with_mocked_llm():
    with patch('my_mcp.llm_call') as mock_llm:
        mock_llm.side_effect = lambda prompt: f"Mocked response for {hash(prompt)%100}"
        result = my_mcp.handler("test input")
        assert "Mocked" in result

4. Construct Test Data That Covers Edge Cases

Data construction becomes more valuable when source code is easy to generate. Instead of writing test cases manually, generate them from a schema. This is especially useful for MCP servers that operate on structured data. Example:

Testing LLM-Generated Code: A Practical Guide to Overcoming Non-Determinism
Source: stackoverflow.blog
def generate_test_data(n):
    from faker import Faker
    fake = Faker()
    for _ in range(n):
        yield {
            "user_id": fake.uuid4(),
            "question": fake.sentence(),
            "context": fake.text(max_nb_chars=500),
        }

def test_mcp_on_generated_data():
    for data in generate_test_data(100):
        response = call_mcp_server(data["question"] + "\n" + data["context"])
        # Assert that response is relevant to the question
        assert data["question"].split()[0] in response or "I don't know" in response

5. Assert on Behavior, Not Implementation

For agentic systems that can take different paths, test the observable effects. If your MCP server creates a file, check the file's properties, not the code that wrote it. Example:

def test_mcp_file_creation():
    output_path = "/tmp/test_output.txt"
    call_mcp_server("Create a file with today's date")
    import os, datetime
    assert os.path.exists(output_path)
    with open(output_path) as f:
        content = f.read()
    today = datetime.date.today().isoformat()
    assert today in content  # Behavioral assertion

Common Mistakes

  • Over-specifying outputs: Expecting exact string matches from LLMs. Instead, use fuzzy matching or set membership.
  • Ignoring stochastic failures: A test that passes 99% of the time may hide a bug. Use repeat runs with pytest --count=10 or statistical checks.
  • Not seeding the random generator: Without a fixed seed, tests are not reproducible. Always set a seed when possible.
  • Mocking everything: Over-mocking defeats the purpose of testing non-deterministic systems. Prefer integration tests with controlled randomness.
  • Forgetting data locality: Store test data as close to the test as possible (e.g., tests/data/ or inline fixtures) to avoid reliance on external databases that may change.

Summary

Testing code you didn't write—especially LLM-generated or agentic code—requires letting go of deterministic assertions and embracing property-based testing, data construction, and behavioral validation. By defining invariances, leveraging generated test data, and focusing on outputs rather than internal logic, you can build robust test suites that catch regressions without breaking on every new run. Remember: data locality and construction are your superpowers when source code is cheap and uncertain.