In 2000, Koen Claessen and John Hughes published a paper that would quietly change how a subset of the programming world thinks about testing. Their solution to a simple problem — how to test functions exhaustively without writing hundreds of test cases — was elegant enough to endure for over two decades.
It was called QuickCheck. And it introduced the world to property-based testing.
Today, property-based testing is no longer a niche Haskell curiosity. It’s a mature practice available in virtually every language, and it’s about to become even more important as AI agents flood our codebases with generated code.
The Problem with Example-Based Testing
Most developers write tests like this:
def test_sort():
assert sort([3, 1, 2]) == [1, 2, 3]
assert sort([]) == []
assert sort([5]) == [5]
def test_parse_date():
assert parse_date("2024-01-15") == date(2024, 1, 15)
assert parse_date("1999-12-31") == date(1999, 12, 31)
This is example-based testing: you write specific inputs and their expected outputs. It’s intuitive, readable, and perfectly fine for many cases. But it has a fundamental limitation: you can only test what you think of.
Consider this real bug that was discovered through property-based testing in NumPy’s wald function (which samples from a Wald/inverse Gaussian distribution):
Property: samples should always be positive
Bug: catastrophic cancellation caused negative samples
Status: Patch merged in PR #29609
Would you have thought to test for negative values from a distribution that’s supposed to be positive? Probably not — you’d test the obvious cases. But property-based testing doesn’t rely on your intuition. It tests the property itself: “this function should never produce output X.”
The Philosophy: Test What Should Be True
Property-based testing flips the approach. Instead of specifying inputs and expected outputs, you specify properties — invariants that should hold for all valid inputs. The testing framework then generates hundreds (or thousands) of random test cases and verifies that your properties hold.
The key insight is this: properties are simpler and more general than test cases.
Consider testing a sorting function:
Example-based: "sort([3,1,2]) should return [1,2,3]"
Property-based: "the output should always be sorted and should contain the same elements"
The property captures an infinite number of test cases in a single statement. It doesn’t care about the specific values — it cares about the relationship between input and output.
This is where the philosophy diverges from traditional testing. Example-based testing asks: “Does this function produce the right answer for these cases?” Property-based testing asks: “What must always be true about this function’s behavior?”
A Quick History
QuickCheck was inspired by a simple observation: many functions have properties that are easier to state than to compute the expected output for.
For example, consider a prime factorization function factorize(n). To write an example-based test, you need to know the correct answer:
-- Example-based: requires knowing the answer
assert factorize 12 == [2, 2, 3]
assert factorize 15 == [3, 5]
But the property is simpler to state than to verify:
-- Property-based: simpler to state than to compute
property factorize n = product (factorize n) == n
The product of the factors must equal the original number. That’s the property. You don’t need to know the factors in advance — you just need to know what’s true about them.
QuickCheck was written in Haskell, but the idea spread. It influenced:
- QuickCheck for C (2003) — testing C code with property-based approaches
- Hypothesis (2014) — David R. MacIver and Zac Hatfield-Dodds created the Python library that would become the gold standard for PBT in Python
- fast-check (2017) — Nicolas Dubien’s TypeScript library, now the dominant PBT tool in the JS/TS ecosystem with 30M+ weekly npm downloads
- Proptest (2016) — Rust’s property-based testing framework
- And dozens more across every major language
The core idea has remained remarkably stable for 26 years.
The Core Mechanics
Property-based testing frameworks share a common set of concepts:
Generators (Arbitraries)
A generator describes a distribution of values. Instead of writing individual test inputs, you describe what inputs are valid:
# "Give me lists of integers"
st.lists(st.integers())
# "Give me non-empty lists of unique positive integers under 100"
st.lists(st.integers(min_value=1, max_value=100), min_size=1, unique=True)
// "Give me arrays of natural numbers"
fc.array(fc.nat());
// "Give me strings with a maximum length of 50"
fc.string({ maxLength: 50 });
Shrinking
When a property fails, the framework doesn’t just show you the failing input — it finds the simplest failing input. This is called shrinking, and it’s one of the most powerful features of property-based testing.
If a test fails on [1048576, 2147483647, -524288], shrinking will reduce it to [0] or [1] — the minimal case that still breaks your property. This makes debugging dramatically easier.
Properties
A property is a predicate that should be true for all generated inputs:
@given(st.lists(st.integers()))
def test_sort_is_sorted(lst):
result = my_sort(lst.copy())
assert result == sorted(result) # The output is always sorted
fc.assert(
fc.property(fc.array(fc.nat()), (arr) => {
const sorted = bubbleSort(arr);
return sorted.every((n, i) => i === 0 || sorted[i - 1] <= n);
}),
);
Advantages Over Example-Based Testing
1. Finding Edge Cases You Wouldn’t Think Of
Property-based testing generates inputs that a human might never consider: empty collections, extremely large numbers, boundary values, and combinations of conditions.
For example, Anthropic’s agentic PBT agent discovered a bug in AWS Lambda Powertools’ slice_dictionary function:
Property: slicing and reconstructing a dictionary should return the original
Bug: the function returned the first chunk repeatedly instead of all chunks
Status: Patch merged in PR #7246
2. Breaking the “Cycle of Self-Deception”
When both code and tests are generated by the same LLM, they can share the same logical errors. A classic example from the PGS (Property-Generated Solver) paper:
Code generator: "factorize(12) = [2, 3]" (missing multiplicity)
Test generator (same LLM): "assert factorize(12) == [2, 3]" (same error)
The property “product of output factors must equal original input” is simpler to define than predicting exact oracles, and it breaks this cycle. PBT properties are abstract enough that they’re less likely to share the same bias as the code.
3. More Coverage with Less Code
A single property can replace dozens of example-based test cases. The Hypothesis library runs 100 examples by default (configurable to thousands), each with different random inputs. You get coverage that would require hundreds of hand-written tests.
4. Better Regression Resistance
When a property fails, shrinking gives you a minimal counterexample. This is often more informative than a specific failing case because it reveals the root cause of the violation, not just a particular input that broke.
5. Works Well with AI-Generated Code
As AI agents write more code, the testing challenge changes. Agents are good at writing code but can share blind spots with the code they generate. PBT properties are higher-level and more abstract, making them harder for an agent to get wrong in the same way the code is wrong.
The Limitations
Property-based testing isn’t a silver bullet. It has real limitations:
It doesn’t replace example-based testing. The two approaches complement each other. Use example-based tests for specific known behaviors and edge cases. Use property-based tests for general invariants and to discover unknown edge cases.
Some properties are hard to identify. For complex business logic, it’s not always clear what properties should hold. Example-based testing can be easier when you have very specific requirements.
Tests can be slower. Running 1000 random examples takes longer than running 5 specific cases. This is usually acceptable (most PBT tests still run in milliseconds), but it’s worth being aware of.
The “oracle problem” for AI agents. Testing non-deterministic systems (like AI agents) requires thinking about properties differently. You test what the agent must never do (no destructive actions without confirmation, no hallucinated URLs) rather than what it should output exactly.
Property-Based Testing with AI Agents
This is where things get genuinely exciting. Property-based testing and AI agents have a natural synergy that’s only beginning to be explored.
LLMs Are Good at Properties, Bad at Oracles
LLMs excel at reading code and documentation to infer what should be true about a function. They struggle with predicting exact outputs for complex inputs. PBT is perfect for this: you verify invariants (properties) rather than exact values (oracles).
Anthropic demonstrated this with their Agentic PBT project. An agent built on Claude Code autonomously:
- Crawled through entire codebases, reading type annotations, docstrings, and comments
- Inferred function-specific properties
- Wrote Hypothesis property tests and executed them
- Reflected on test outputs to confirm real bugs vs. false alarms
The results across 100 Python packages (933 modules): 984 bug reports were generated, 56% were valid bugs, and 32% were valid bugs worth reporting to maintainers. The top-scoring bugs had an 86% validity rate. Real bugs were found in NumPy, SciPy, Pandas, HuggingFace Tokenizers, and AWS Lambda Powertools.
Testing AI Agent Behavior
Property-based testing is becoming a first-class approach for testing AI agents themselves. Instead of testing exact outputs (which are non-deterministic), you test behavioral invariants:
- Safety properties: No destructive actions without confirmation, no hallucinated URLs, no PII leakage
- Budget constraints: Max cost per request, max reasoning steps, no infinite loops
- Routing properties: Tax agent only answers tax questions, proper tool selection
PostHog’s approach distinguishes between deterministic evaluators (specific tool calls, forbidden keywords) and non-deterministic evaluators (LLM-as-Judge for subjective criteria). This is essentially property-based testing adapted for non-deterministic systems.
The Shrinking Advantage for Agents
When a property fails, Hypothesis’s shrinking produces the minimal failing case. For LLM agents, this means the simplest scenario that causes a property violation is surfaced, making it much easier for both humans and agents to understand and fix the root cause.
PGS: Property-Generated Code Generation
The PGS (Property-Generated Solver) framework uses two collaborative LLM agents: a Generator that creates code from specifications, and a Tester that defines properties, generates PBT inputs, and validates the code. PGS achieved 23.1%-37.3% relative improvement in pass@1 over traditional TDD methods on HumanEval, MBPP, and LiveCodeBench benchmarks.
The key insight: PBT avoids the “cycle of self-deception” where both the code generator and test generator share the same misunderstanding. By decoupling Generator and Tester agents and using simple, abstract properties, PGS produces significantly better code.
Python Example: Hypothesis
Hypothesis is the gold standard for property-based testing in Python. Created by David R. MacIver and Zac Hatfield-Dodds, it’s one of the most downloaded Python testing libraries with over 38,000 projects depending on it.
The Basics
from hypothesis import given, strategies as st
def my_sort(lst):
"""A selection sort implementation."""
result = []
while lst:
smallest = min(lst)
result.append(smallest)
lst.remove(smallest)
return result
@given(st.lists(st.integers()))
def test_sort_is_always_sorted(lst):
result = my_sort(lst.copy())
assert result == sorted(result)
The @given decorator tells Hypothesis to generate random inputs using the specified strategies. By default, it runs the test 100 times with different inputs. If any run fails, Hypothesis reports the failure and shrinks the input to the minimal counterexample.
Testing a Data Structure
from hypothesis import given, strategies as st, settings
from collections import OrderedDict
class SimpleLRUCache:
def __init__(self, capacity: int):
self.capacity = capacity
self.cache = OrderedDict()
def get(self, key):
if key not in self.cache:
return -1
self.cache.move_to_end(key)
return self.cache[key]
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
elif len(self.cache) >= self.capacity:
self.cache.popitem(last=False)
self.cache[key] = value
@given(
capacity=st.integers(min_value=1, max_value=5),
operations=st.lists(
st.one_of(
st.tuples(st.just("put"), st.integers(0, 9), st.integers()),
st.tuples(st.just("get"), st.integers(0, 9)),
),
min_size=1,
max_size=30,
),
)
@settings(max_examples=200)
def test_lru_cache_eviction(capacity, operations):
cache = SimpleLRUCache(capacity)
expected = OrderedDict()
for op in operations:
if op[0] == "put":
_, key, value = op
cache.put(key, value)
expected[key] = value
while len(expected) > capacity:
expected.popitem(last=False)
else:
_, key = op
if key in expected:
assert cache.get(key) == expected[key]
expected.move_to_end(key)
else:
assert cache.get(key) == -1
This test generates random sequences of put and get operations and verifies that the LRU cache behaves correctly — evicting the least recently used items when full, returning correct values for cached keys, and returning -1 for missing keys.
Key Hypothesis Features
Shrinking — When a test fails, Hypothesis finds the minimal failing input:
Falsifying example: test_sort_is_always_sorted(lst=[0, 0])
Example database — Hypothesis stores failing examples and replays them on subsequent test runs, ensuring that known failures are always caught.
Ghostwriter — Hypothesis can auto-generate property tests from function signatures:
hypothesis write --roundtrip json.dumps json.loads
hypothesis write --idempotent sorted
Stateful testing — Hypothesis supports model-based testing via RuleBasedStateMachine for testing complex state machines.
TypeScript Example: fast-check
fast-check is the dominant property-based testing library in the TypeScript/JavaScript ecosystem, with over 30 million weekly npm downloads and 5,000+ GitHub stars.
The Basics
import fc from "fast-check";
function bubbleSort(arr: number[]): number[] {
const result = [...arr];
for (let i = 0; i < result.length; i++) {
for (let j = 0; j < result.length - 1 - i; j++) {
if (result[j] > result[j + 1]) {
[result[j], result[j + 1]] = [result[j + 1], result[j]];
}
}
}
return result;
}
fc.assert(
fc.property(fc.array(fc.nat()), (arr) => {
const sorted = bubbleSort(arr);
return sorted.every((n, i) => i === 0 || sorted[i - 1] <= n);
}),
{ numRuns: 500 },
);
fc.property declares what to test (arbitraries + predicate), and fc.assert runs it. The property asserts that a sorted array is always in ascending order.
Testing String Properties
function isPalindrome(s: string): boolean {
const cleaned = s.toLowerCase().replace(/[^a-z0-9]/g, "");
return cleaned === cleaned.split("").reverse().join("");
}
// Property: concatenating a string with its reverse always produces a palindrome
fc.assert(
fc.property(fc.string({ maxLength: 50 }), (s) => {
const combined = s + s.split("").reverse().join("");
return isPalindrome(combined);
}),
);
// Property: a palindrome reads the same forwards and backwards
fc.assert(
fc.property(fc.string({ minLength: 1, maxLength: 50 }), (s) => {
const combined = s + s.split("").reverse().join("");
for (let i = 0; i < combined.length; i++) {
if (combined[i] !== combined[combined.length - 1 - i]) {
return false;
}
}
return true;
}),
);
Testing JSON Roundtrip
// Property: parsing a stringified value gives back an equal value
fc.assert(
fc.property(fc.anything(), (value) => {
const json = JSON.stringify(value);
const parsed = JSON.parse(json);
expect(parsed).toEqual(value);
}),
{ numRuns: 200 },
);
Key fast-check Features
Shrinking — When a test fails, fast-check automatically shrinks the input to the smallest counterexample.
Preconditions — Use fc.pre() to filter out invalid inputs:
fc.assert(
fc.property(fc.nat(), fc.string(), (maxLength, label) => {
fc.pre(label.length <= maxLength);
return crop(label, maxLength) === label;
}),
);
Async properties — For testing async functions:
fc.assert(
fc.asyncProperty(fc.string(), fc.string(), async (a, b) => {
const result = await concatAsync(a, b);
expect(result).toBe(a + b);
}),
);
Reproducibility — Use seed to reproduce specific test runs:
fc.assert(property, { seed: 1234 });
How to Get Started
If you’re new to property-based testing, here’s a practical path:
-
Start with one function. Pick a pure function with clear invariants — a sorting function, a parser, a string transformer. Write one property for it.
-
Learn the shrinking. When a test fails, pay attention to how the framework shrinks the input. This is where the real debugging power lives.
-
Combine with example-based tests. Use @example (Hypothesis) or specific test cases alongside properties. The best test suites use both approaches.
-
Add properties for data structures. LRU caches, queues, and other data structures have rich invariants that are perfect for PBT.
-
Try the ghostwriter. Hypothesis’s hypothesis write command can generate property tests from function signatures automatically — a great way to get started.
The Future: PBT in the Age of AI
Property-based testing is uniquely positioned for the coming era of AI-assisted development. Here’s why:
AI agents can discover properties. Anthropic’s research shows that LLMs are particularly good at identifying properties from code context — reading type annotations, docstrings, and function names to infer what should be true. This means agents can auto-generate property tests from existing codebases.
Properties are simpler for agents to verify. Predicting exact outputs for complex functions is hard for LLMs. But stating “this function should never return negative values” or “this output should always be a valid JSON” is much easier. PBT aligns with what LLMs do well.
PBT catches agent-generated bugs. When AI agents write code, they can introduce subtle bugs that example-based tests miss. Property-based tests catch these by testing invariants rather than specific cases. The PGS framework’s 23-37% improvement over TDD shows this empirically.
Non-deterministic systems need properties. AI agents produce non-deterministic outputs. Traditional equality assertions don’t work. But properties — “the agent should never delete files without confirmation” — work perfectly. PBT is the natural testing paradigm for AI systems.
The trajectory is clear: as AI agents write more code and make more decisions, property-based testing will become essential for maintaining confidence in software quality. The frameworks are mature, the techniques are proven, and the synergy with AI is just beginning to be explored.
The next time you write a test, ask yourself: “What must always be true about this function’s behavior?” That’s the property. And once you state it, you’ll never want to write another example-based test case.