Property based testing for unit testers with PropEr - Part 1

This tutorial is brought to you by ErlangCamp 2011 (click here) - Boston, August 12th and 13th - It’s gonna be totally sweet!

Main contributors: Torben Hoffmann, Raghav Karol, Eric Merritt

The purpose of the short document is to help people who are familiar
with unit testing understand how property based testing (PBT) differs,
but also where the thinking is the same.

This document focusses on the PBT tool
PropEr for Erlang since that is
what I am familiar with, but the general principles applies to all PBT
tools regardless of which language they are written in.

The approach taken here is that we hear from people who are used to
working with unit testing regarding how they think when designing
their tests and how a concrete test might look.

These descriptions are then “converted” into the way it works with
PBT, with a clear focus on what stays the same and what is different.

Testing philosophies

A quote from Martin Logan (@martinjlogan):

For me unit testing is about contracts. I think about the same things

I think about when I write statements like {ok, Resp} =

Mod:Func(Args). Unit testing and writing specs are very close for me.

Hypothetically speaking lets say a function should return return {ok,

string()} | {error, term()} for all given input parameters then my

unit tests should be able to show that for a representative set of

input parameters that those contracts are honored. The art comes in

thinking about what that set is.

The trap in writing all your own tests can often be that we think
about the set in terms of what we coded for and not what may indeed be
asked of our function. As the code is tried in further exploratory
testing and in production new input parameter sets for which the given
function does not meet the stated contract are discovered and added to
the test case once a fix has been put into place.

This is a very good description of what the ground rules for unit
testing are:

  • Checking that contracts are obeyed.
  • Creating a representative set of input parameters.

The former is very much part of PBT - each property you write will
check a contract, so that thinking is the same.

xUnit vs PBT

Unit testing has become popular for software testing with the advent
of xUnit tools like jUnit for Java. xUnit like tools typically
provide a testing framework with the following functionality

  • test fixture setup
  • test case execution
  • test fixture teardown
  • test suite management
  • test status reporting and management

While xUnit tools provide a lot of functionality to execute and manage
test cases and suites, reporting results there is no focus on test
case execution step, while this is the main focus area of
property-based testing (PBT).

Consider the following function specification

sort(list::integer()) ---> list::integer() | error  

A verbal specification of this function is,

For all input lists of integers, the sort function returns a sorted

list of integers.

For any other kind of argument the function returns the atom error.

The specification above may be a requirement of how the function
should behave or even how the function does behave. This distinction
is important; the former is the requirement for the function, the
latter is the actual API. Both should be the same and that is what our
testing should confirm. Test cases for this function might look like

assertEqual(sort([5,4,3,2,1]), [1,2,3,4,5])  
assertEqual(sort([1,2,3,4,5]), [1,2,3,4,5])  
assertEqual(sort([]         ), []         )  
assertEqual(sort([-1,0, 1]  ), [-1, 0, 1] )  

How many tests cases should we write to be convinced that the actual
behaviour of the function is the same as its specification? Clearly,
it is impossible to write tests cases for all possible input values,
here all lists of integers, the art of testing is finding individual
input values that are representative of a large part of the input
space. We hope that the test cases are exhaustive to cover the
specification. xUnit tools offer no support for this and this is where
PBT and PBT Tools like PropEr and QuickCheck come in.

PBT introduces testing with a large set of random input values and
verifying that the specification holds for each input value
selected. Functions used to generate input values, generators, are
specified using rules and can be simply composed together to construct
complicated values. So, a property based test for the function above
may look like:

FOREACH({I, J, InputList},  {nat(), nat(), integer_list()},  
    SUCHTHAT(I < J andalso J < length(InputList),  
    SortedList = sort(InputList)  
    length(SortedList) == length(InputList)  
    lists:get(SortedList, I) =< lists:get(SortedList, J))  

The property above works as follows

  • Generate a random list of integers InputList and two natural numbers
    I, J, such that I < J < size of InputList
  • Check that size of sorted and input lists is the same.
  • Check that element with smaller index I is less than or equal to
    element with larger index J in SortedList.

Notice in the property above, we specify property. Verification of
the property based on random input values will be done by the property
based tool, therefore we can generated a large number of tests cases
with random input values and have a higher level of confidence that
the function when using unit tests alone.

But it does not stop at generation of input parameters. If you have
more complex tests where you have to generate a series of events and
keep track of some state then your PBT tool will generate random
sequences of events which corresponds to legal sequences of events and
test that your system behaves correctly for all sequences.

So when you have written a property with associated generators you
have in fact created something that can create numerous test cases -
you just have to tell your PBT tool how many test cases you want to
check the property on.

Shrinking the bar

At this point you might still have the feeling that introducing the
notion of some sort of generators to your unit testing tool of choice
would bring you on par with PBT tools, but wait there is more to

When a PBT tool creates a test case that fails there is real chance
that it has created a long test case or some big input parameters -
trying to debug that is very much like receiving a humongous log from
a system in the field and try to figure out what cause the system to

Enter shrinking…

When a test case fails the PBT tool will try to shrink the failing
test case down to the essentials by stripping out input elements or
events that does not cause the failure. In most cases this results in
a very short counterexample that clearly states which events and
inputs are required to break a property.

As we go through some concrete examples later the effects of shrinking
will be shown.

Shrinking makes it a lot easier to debug problems and is as key to the
strength of PBT as the generators.

Converting a unit test

We will now take a look at one possible way of translating a unit
test into a PBT setting.

The example comes from Eric Merritt and is about the add/2 function in
the ec_dictionary instance ec_gb_trees.

The add function has the following spec:

-spec add(ec_dictionary:key(), ec_dictionary:value(), Object::dictionary()) ->  

and it is supposed to do the obvious: add the key and value pair to
the dictionary and return a new dictionary.

Eric states his basic expectations as follows:

  1. I can put arbitrary terms into the dictionary as keys
  2. I can put arbitrary terms into the dictionary as values
  3. When I put a value in the dictionary by a key, I can retrieve that same value
  4. When I put a different value in the dictionary by key it does not change other key value pairs.
  5. When I update a value the new value in available by the new key
  6. When a value does not exist a not found exception is created

The first two expectations regarding being able to use arbritrary
terms as keys and values is a job for generators.

The latter four are prime candidates for properties and we will create
one for each of them.


key() -> any().  
value() -> any().  

For PropEr this approach has the drawback that creation and shrinking
becomes rather time consuming, so it might be better to narrow to
something like this:

key() -> union([integer(),atom()]).  
value() -> union([integer(),atom(),binary(),boolean(),string()]).  

What is best depends on the situation and intended usage.

Now, being able to generate keys and values is not enough. You also
have to tell PropEr how to create a dictionary and in this case we
will use a symbolic generator (detail to be explained later).

sym_dict() ->  
sym_dict(0) ->  
sym_dict(N) ->  
                  {1, {'$call',ec_dictionary,remove,[key(),sym_dict(N-1)]}},  
                  {2, {'$call',ec_dictionary,add,[value(),value(),sym_dict(N-1)]}}  

sym_dict/0 uses the ?SIZED macro to control the size of the
generated dictionary. PropEr will start out with small numbers and
gradually raise it.

sym_dict/1 is building a dictionary by randomly adding key/value
pairs and removing keys. Eventually the base case is reached which
will create an empty dictionary.

The ?LAZY macro is used to defer the calculation of the
sym_dict(N-1) until they are needed and frequency/1 is used
to ensure that twice as many adds compared to removes are done. This
should give rather more interesting dictionaries in the long run, if
not one can alter the frequencies accondingly.

But does it really work?

That is a good question and one that should always be asked when
looking at genetors. Fortunately there is a way to see what a
generator produces provided that the generator functions are exported.

Hint: in most cases it will not hurt to throw in a
-compile(export_all). in the module used to specify the
properties. And here we actually have a sub-hint: specify the
properties in a separate file to avoid peeking inside the
implementation! Base the test on the published API as this is what the
users of the code will be restricted to.

When the test module has been loaded you can test the generators by
starting up an Erlang shell (this example uses the erlware_commons
code so get yourself a clone to play with):

$ erl -pz ebin -pz test  
1> proper_gen:pick(ec_dictionary_proper:key()).  
2> proper_gen:pick(ec_dictionary_proper:key()).  
3> proper_gen:pick(ec_dictionary_proper:key()).  
4> proper_gen:pick(ec_dictionary_proper:key()).  
5> proper_gen:pick(ec_dictionary_proper:key()).  
{ok,'36\207_là ´?\nc'}  
6> proper_gen:pick(ec_dictionary_proper:value()).  
7> proper_gen:pick(ec_dictionary_proper:value()).  
8> proper_gen:pick(ec_dictionary_proper:value()).  
9> proper_gen:pick(ec_dictionary_proper:value()).  
10> proper_gen:pick(ec_dictionary_proper:value()).  
11> proper_gen:pick(ec_dictionary_proper:value()).  
12> proper_gen:pick(ec_dictionary_proper:value()).  
13> proper_gen:pick(ec_dictionary_proper:value()).  
14> proper_gen:pick(ec_dictionary_proper:value()).  
15> proper_gen:pick(ec_dictionary_proper:sym_dict()).  
16> proper_gen:pick(ec_dictionary_proper:sym_dict()).  

That does not look too bad, so we will continue with that for now.

Properties of add/2

The first expectation Eric had about how the dictionary works was that
if a key had been stored it could be retrieved.

One way of expressing this could be with this property:

prop_get_after_add_returns_correct_value() ->  
    ?FORALL({Dict,K,V}, {sym_dict(),key(),value()},  
             try ec_dictionary:get(K,ec_dictionary:add(K,V,Dict)) of  
                    V ->  
                    _ ->  
                   _:_ ->  

This property reads that for all dictionaries get/2 using a key
from a key/value pair just inserted using the add/3 function
will return that value. If that is not the case the property will
evaluate to false.

Running the property is done using proper:quickcheck/1:

OK: Passed 100 test(s).  

This was as expected, but at this point we will take a little detour
and introduce a mistake in the ec_gb_trees implementation and see
how that works.