July 4, 2026

Property-Based Testing: Write Fewer Tests, Find More Bugs

What if you could replace hundreds of hand-written test cases with a few lines of code and find more bugs? Property-based testing does exactly that — and it's the perfect testing paradigm for the age of AI agents.

Testing
Property-Based Testing
Python
TypeScript
AI Agents
Software Quality

In 2000, Koen Claessen and John Hughes published a paper that would quietly change how a subset of the programming world thinks about testing. Their solution to a simple problem — how to test functions exhaustively without writing hundreds of test cases — was elegant enough to endure for over two decades.

It was called QuickCheck. And it introduced the world to property-based testing.

Today, property-based testing is no longer a niche Haskell curiosity. It’s a mature practice available in virtually every language, and it’s about to become even more important as AI agents flood our codebases with generated code.

The Problem with Example-Based Testing

Most developers write tests like this:

def test_sort():
    assert sort([3, 1, 2]) == [1, 2, 3]
    assert sort([]) == []
    assert sort([5]) == [5]

def test_parse_date():
    assert parse_date("2024-01-15") == date(2024, 1, 15)
    assert parse_date("1999-12-31") == date(1999, 12, 31)

This is example-based testing: you write specific inputs and their expected outputs. It’s intuitive, readable, and perfectly fine for many cases. But it has a fundamental limitation: you can only test what you think of.

Consider this real bug that was discovered through property-based testing in NumPy’s wald function (which samples from a Wald/inverse Gaussian distribution):

Property: samples should always be positive
Bug: catastrophic cancellation caused negative samples
Status: Patch merged in PR #29609

Would you have thought to test for negative values from a distribution that’s supposed to be positive? Probably not — you’d test the obvious cases. But property-based testing doesn’t rely on your intuition. It tests the property itself: “this function should never produce output X.”

The Philosophy: Test What Should Be True

Property-based testing flips the approach. Instead of specifying inputs and expected outputs, you specify properties — invariants that should hold for all valid inputs. The testing framework then generates hundreds (or thousands) of random test cases and verifies that your properties hold.

The key insight is this: properties are simpler and more general than test cases.

Consider testing a sorting function:

Example-based: "sort([3,1,2]) should return [1,2,3]"
Property-based: "the output should always be sorted and should contain the same elements"

The property captures an infinite number of test cases in a single statement. It doesn’t care about the specific values — it cares about the relationship between input and output.

This is where the philosophy diverges from traditional testing. Example-based testing asks: “Does this function produce the right answer for these cases?” Property-based testing asks: “What must always be true about this function’s behavior?”

A Quick History

QuickCheck was inspired by a simple observation: many functions have properties that are easier to state than to compute the expected output for.

For example, consider a prime factorization function factorize(n). To write an example-based test, you need to know the correct answer:

-- Example-based: requires knowing the answer
assert factorize 12 == [2, 2, 3]
assert factorize 15 == [3, 5]

But the property is simpler to state than to verify:

-- Property-based: simpler to state than to compute
property factorize n = product (factorize n) == n

The product of the factors must equal the original number. That’s the property. You don’t need to know the factors in advance — you just need to know what’s true about them.

QuickCheck was written in Haskell, but the idea spread. It influenced:

QuickCheck for C (2003) — testing C code with property-based approaches
Hypothesis (2014) — David R. MacIver and Zac Hatfield-Dodds created the Python library that would become the gold standard for PBT in Python
fast-check (2017) — Nicolas Dubien’s TypeScript library, now the dominant PBT tool in the JS/TS ecosystem with 30M+ weekly npm downloads
Proptest (2016) — Rust’s property-based testing framework
And dozens more across every major language

The core idea has remained remarkably stable for 26 years.

The Core Mechanics

Property-based testing frameworks share a common set of concepts:

Generators (Arbitraries)

A generator describes a distribution of values. Instead of writing individual test inputs, you describe what inputs are valid:

# "Give me lists of integers"
st.lists(st.integers())

# "Give me non-empty lists of unique positive integers under 100"
st.lists(st.integers(min_value=1, max_value=100), min_size=1, unique=True)

// "Give me arrays of natural numbers"
fc.array(fc.nat());

// "Give me strings with a maximum length of 50"
fc.string({ maxLength: 50 });

Shrinking

When a property fails, the framework doesn’t just show you the failing input — it finds the simplest failing input. This is called shrinking, and it’s one of the most powerful features of property-based testing.

If a test fails on [1048576, 2147483647, -524288], shrinking will reduce it to [0] or [1] — the minimal case that still breaks your property. This makes debugging dramatically easier.

Properties

A property is a predicate that should be true for all generated inputs:

@given(st.lists(st.integers()))
def test_sort_is_sorted(lst):
    result = my_sort(lst.copy())
    assert result == sorted(result)  # The output is always sorted

fc.assert(
  fc.property(fc.array(fc.nat()), (arr) => {
    const sorted = bubbleSort(arr);
    return sorted.every((n, i) => i === 0 || sorted[i - 1] <= n);
  }),
);

Advantages Over Example-Based Testing

1. Finding Edge Cases You Wouldn’t Think Of

Property-based testing generates inputs that a human might never consider: empty collections, extremely large numbers, boundary values, and combinations of conditions.

For example, Anthropic’s agentic PBT agent discovered a bug in AWS Lambda Powertools’ slice_dictionary function:

Property: slicing and reconstructing a dictionary should return the original
Bug: the function returned the first chunk repeatedly instead of all chunks
Status: Patch merged in PR #7246

2. Breaking the “Cycle of Self-Deception”

When both code and tests are generated by the same LLM, they can share the same logical errors. A classic example from the PGS (Property-Generated Solver) paper:

Code generator: "factorize(12) = [2, 3]" (missing multiplicity)
Test generator (same LLM): "assert factorize(12) == [2, 3]" (same error)

The property “product of output factors must equal original input” is simpler to define than predicting exact oracles, and it breaks this cycle. PBT properties are abstract enough that they’re less likely to share the same bias as the code.

3. More Coverage with Less Code

A single property can replace dozens of example-based test cases. The Hypothesis library runs 100 examples by default (configurable to thousands), each with different random inputs. You get coverage that would require hundreds of hand-written tests.

4. Better Regression Resistance

When a property fails, shrinking gives you a minimal counterexample. This is often more informative than a specific failing case because it reveals the root cause of the violation, not just a particular input that broke.

5. Works Well with AI-Generated Code

As AI agents write more code, the testing challenge changes. Agents are good at writing code but can share blind spots with the code they generate. PBT properties are higher-level and more abstract, making them harder for an agent to get wrong in the same way the code is wrong.

The Limitations

Property-based testing isn’t a silver bullet. It has real limitations:

It doesn’t replace example-based testing. The two approaches complement each other. Use example-based tests for specific known behaviors and edge cases. Use property-based tests for general invariants and to discover unknown edge cases.

Some properties are hard to identify. For complex business logic, it’s not always clear what properties should hold. Example-based testing can be easier when you have very specific requirements.

Tests can be slower. Running 1000 random examples takes longer than running 5 specific cases. This is usually acceptable (most PBT tests still run in milliseconds), but it’s worth being aware of.

The “oracle problem” for AI agents. Testing non-deterministic systems (like AI agents) requires thinking about properties differently. You test what the agent must never do (no destructive actions without confirmation, no hallucinated URLs) rather than what it should output exactly.

Property-Based Testing with AI Agents

This is where things get genuinely exciting. Property-based testing and AI agents have a natural synergy that’s only beginning to be explored.

LLMs Are Good at Properties, Bad at Oracles

LLMs excel at reading code and documentation to infer what should be true about a function. They struggle with predicting exact outputs for complex inputs. PBT is perfect for this: you verify invariants (properties) rather than exact values (oracles).

Anthropic demonstrated this with their Agentic PBT project. An agent built on Claude Code autonomously:

Crawled through entire codebases, reading type annotations, docstrings, and comments
Inferred function-specific properties
Wrote Hypothesis property tests and executed them
Reflected on test outputs to confirm real bugs vs. false alarms

The results across 100 Python packages (933 modules): 984 bug reports were generated, 56% were valid bugs, and 32% were valid bugs worth reporting to maintainers. The top-scoring bugs had an 86% validity rate. Real bugs were found in NumPy, SciPy, Pandas, HuggingFace Tokenizers, and AWS Lambda Powertools.

Testing AI Agent Behavior

Property-based testing is becoming a first-class approach for testing AI agents themselves. Instead of testing exact outputs (which are non-deterministic), you test behavioral invariants:

Safety properties: No destructive actions without confirmation, no hallucinated URLs, no PII leakage
Budget constraints: Max cost per request, max reasoning steps, no infinite loops
Routing properties: Tax agent only answers tax questions, proper tool selection

PostHog’s approach distinguishes between deterministic evaluators (specific tool calls, forbidden keywords) and non-deterministic evaluators (LLM-as-Judge for subjective criteria). This is essentially property-based testing adapted for non-deterministic systems.

The Shrinking Advantage for Agents

When a property fails, Hypothesis’s shrinking produces the minimal failing case. For LLM agents, this means the simplest scenario that causes a property violation is surfaced, making it much easier for both humans and agents to understand and fix the root cause.

PGS: Property-Generated Code Generation

The PGS (Property-Generated Solver) framework uses two collaborative LLM agents: a Generator that creates code from specifications, and a Tester that defines properties, generates PBT inputs, and validates the code. PGS achieved 23.1%-37.3% relative improvement in pass@1 over traditional TDD methods on HumanEval, MBPP, and LiveCodeBench benchmarks.

The key insight: PBT avoids the “cycle of self-deception” where both the code generator and test generator share the same misunderstanding. By decoupling Generator and Tester agents and using simple, abstract properties, PGS produces significantly better code.

Python Example: Hypothesis

Hypothesis is the gold standard for property-based testing in Python. Created by David R. MacIver and Zac Hatfield-Dodds, it’s one of the most downloaded Python testing libraries with over 38,000 projects depending on it.

The Basics

from hypothesis import given, strategies as st

def my_sort(lst):
    """A selection sort implementation."""
    result = []
    while lst:
        smallest = min(lst)
        result.append(smallest)
        lst.remove(smallest)
    return result

@given(st.lists(st.integers()))
def test_sort_is_always_sorted(lst):
    result = my_sort(lst.copy())
    assert result == sorted(result)

The @given decorator tells Hypothesis to generate random inputs using the specified strategies. By default, it runs the test 100 times with different inputs. If any run fails, Hypothesis reports the failure and shrinks the input to the minimal counterexample.

Testing a Data Structure

from hypothesis import given, strategies as st, settings
from collections import OrderedDict

class SimpleLRUCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = OrderedDict()

    def get(self, key):
        if key not in self.cache:
            return -1
        self.cache.move_to_end(key)
        return self.cache[key]

    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        elif len(self.cache) >= self.capacity:
            self.cache.popitem(last=False)
        self.cache[key] = value

@given(
    capacity=st.integers(min_value=1, max_value=5),
    operations=st.lists(
        st.one_of(
            st.tuples(st.just("put"), st.integers(0, 9), st.integers()),
            st.tuples(st.just("get"), st.integers(0, 9)),
        ),
        min_size=1,
        max_size=30,
    ),
)
@settings(max_examples=200)
def test_lru_cache_eviction(capacity, operations):
    cache = SimpleLRUCache(capacity)
    expected = OrderedDict()

    for op in operations:
        if op[0] == "put":
            _, key, value = op
            cache.put(key, value)
            expected[key] = value
            while len(expected) > capacity:
                expected.popitem(last=False)
        else:
            _, key = op
            if key in expected:
                assert cache.get(key) == expected[key]
                expected.move_to_end(key)
            else:
                assert cache.get(key) == -1

This test generates random sequences of put and get operations and verifies that the LRU cache behaves correctly — evicting the least recently used items when full, returning correct values for cached keys, and returning -1 for missing keys.

Key Hypothesis Features

Shrinking — When a test fails, Hypothesis finds the minimal failing input:

Falsifying example: test_sort_is_always_sorted(lst=[0, 0])

Example database — Hypothesis stores failing examples and replays them on subsequent test runs, ensuring that known failures are always caught.

Ghostwriter — Hypothesis can auto-generate property tests from function signatures:

hypothesis write --roundtrip json.dumps json.loads
hypothesis write --idempotent sorted

Stateful testing — Hypothesis supports model-based testing via RuleBasedStateMachine for testing complex state machines.

TypeScript Example: fast-check

fast-check is the dominant property-based testing library in the TypeScript/JavaScript ecosystem, with over 30 million weekly npm downloads and 5,000+ GitHub stars.

The Basics

import fc from "fast-check";

function bubbleSort(arr: number[]): number[] {
  const result = [...arr];
  for (let i = 0; i < result.length; i++) {
    for (let j = 0; j < result.length - 1 - i; j++) {
      if (result[j] > result[j + 1]) {
        [result[j], result[j + 1]] = [result[j + 1], result[j]];
      }
    }
  }
  return result;
}

fc.assert(
  fc.property(fc.array(fc.nat()), (arr) => {
    const sorted = bubbleSort(arr);
    return sorted.every((n, i) => i === 0 || sorted[i - 1] <= n);
  }),
  { numRuns: 500 },
);

fc.property declares what to test (arbitraries + predicate), and fc.assert runs it. The property asserts that a sorted array is always in ascending order.

Testing String Properties

function isPalindrome(s: string): boolean {
  const cleaned = s.toLowerCase().replace(/[^a-z0-9]/g, "");
  return cleaned === cleaned.split("").reverse().join("");
}

// Property: concatenating a string with its reverse always produces a palindrome
fc.assert(
  fc.property(fc.string({ maxLength: 50 }), (s) => {
    const combined = s + s.split("").reverse().join("");
    return isPalindrome(combined);
  }),
);

// Property: a palindrome reads the same forwards and backwards
fc.assert(
  fc.property(fc.string({ minLength: 1, maxLength: 50 }), (s) => {
    const combined = s + s.split("").reverse().join("");
    for (let i = 0; i < combined.length; i++) {
      if (combined[i] !== combined[combined.length - 1 - i]) {
        return false;
      }
    }
    return true;
  }),
);

Testing JSON Roundtrip

// Property: parsing a stringified value gives back an equal value
fc.assert(
  fc.property(fc.anything(), (value) => {
    const json = JSON.stringify(value);
    const parsed = JSON.parse(json);
    expect(parsed).toEqual(value);
  }),
  { numRuns: 200 },
);

Key fast-check Features

Shrinking — When a test fails, fast-check automatically shrinks the input to the smallest counterexample.

Preconditions — Use fc.pre() to filter out invalid inputs:

fc.assert(
  fc.property(fc.nat(), fc.string(), (maxLength, label) => {
    fc.pre(label.length <= maxLength);
    return crop(label, maxLength) === label;
  }),
);

Async properties — For testing async functions:

fc.assert(
  fc.asyncProperty(fc.string(), fc.string(), async (a, b) => {
    const result = await concatAsync(a, b);
    expect(result).toBe(a + b);
  }),
);

Reproducibility — Use seed to reproduce specific test runs:

fc.assert(property, { seed: 1234 });

How to Get Started

If you’re new to property-based testing, here’s a practical path:

Start with one function. Pick a pure function with clear invariants — a sorting function, a parser, a string transformer. Write one property for it.
Learn the shrinking. When a test fails, pay attention to how the framework shrinks the input. This is where the real debugging power lives.
Combine with example-based tests. Use @example (Hypothesis) or specific test cases alongside properties. The best test suites use both approaches.
Add properties for data structures. LRU caches, queues, and other data structures have rich invariants that are perfect for PBT.
Try the ghostwriter. Hypothesis’s hypothesis write command can generate property tests from function signatures automatically — a great way to get started.

The Future: PBT in the Age of AI

Property-based testing is uniquely positioned for the coming era of AI-assisted development. Here’s why:

AI agents can discover properties. Anthropic’s research shows that LLMs are particularly good at identifying properties from code context — reading type annotations, docstrings, and function names to infer what should be true. This means agents can auto-generate property tests from existing codebases.

Properties are simpler for agents to verify. Predicting exact outputs for complex functions is hard for LLMs. But stating “this function should never return negative values” or “this output should always be a valid JSON” is much easier. PBT aligns with what LLMs do well.

PBT catches agent-generated bugs. When AI agents write code, they can introduce subtle bugs that example-based tests miss. Property-based tests catch these by testing invariants rather than specific cases. The PGS framework’s 23-37% improvement over TDD shows this empirically.

Non-deterministic systems need properties. AI agents produce non-deterministic outputs. Traditional equality assertions don’t work. But properties — “the agent should never delete files without confirmation” — work perfectly. PBT is the natural testing paradigm for AI systems.

The trajectory is clear: as AI agents write more code and make more decisions, property-based testing will become essential for maintaining confidence in software quality. The frameworks are mature, the techniques are proven, and the synergy with AI is just beginning to be explored.

The next time you write a test, ask yourself: “What must always be true about this function’s behavior?” That’s the property. And once you state it, you’ll never want to write another example-based test case.