Verification (testing)

How confident are you that your code is doing what you think it’s doing?

When I began learning computational neuroscience and writing code for models and simulations, my programming experience amounted to no more than one course and one programming assignment using FORTRAN 77, as part of my undergraduate Physics degree and a short course on image processing and a little ad hoc data processing using Matlab during my MSc in Medical Physics.

When I wrote code (using Hoc and NMODL, the languages of the NEURON simulator) I tested it as I wrote it by running it and comparing the output to what I expected to see, to the results of previous simulations (one of my first projects was porting a model from GENESIS to NEURON, so I could quantitatively compare the output of the two simulators), and, later on, to experimental data.

If I might generalize from my own experience, from talking to colleagues, and from supervising student projects, this kind of informal, the-results-look-about-right testing is very widespread, especially for the physics- and biology-trained among us without any formal computer-science education.

Automated testing

Perhaps the biggest flaw of my informal testing regime was that none of the tests were automated. So, for example, I would test that the height of the excitatory-post-synaptic-potential was what I calculated it should be by plotting the membrane potential and reading the value off the graph. Each time I changed any code that could have affected this, I had to repeat this manual procedure, which of course discouraged any thought of making large-scale reorganisations of the code.

The advantages of creating automated tests are therefore:

  • it gives you confidence that your code is doing what you think it is doing;
  • it frees you to make wide-ranging changes to the code (for the purposes of optimization or making the code more robust, for example) without worrying that you will break something: if you do break something, your tests will tell you immediately and you can undo the change.

Of course, writing tests requires an initial time investment, but if you already perform manual, informal testing then this time will be paid back the first time you run the automated suite of tests. Even if you did no testing at all previously, the loss of fear of changing code will lead to more rapid progress.

There is one gotcha to be aware of with automated tests, a risk of false confidence which can lead to a lack of critical thinking about your results (“if the tests pass, it must be alright”). It is unlikely that your test suite will test every possible path through your code with all possible inputs, so you should always be aware of the fallibility of your test procedures, and should expect to add more tests as the project develops.

Terminology

Professional software engineering, where automated testing has been in wide use for a long time, has developed a rich vocabulary surrounding testing. For scientists, it is useful to understand at least the following three ideas:

unit test
a test of a single element of a program (e.g. a single function) in isolation.
system test
a test of an entire program, or an entire sub-system.
regression test
a test for a specific bug that was found in a previous version of the program; the test is to ensure that the bug does not reappear once fixed. Regression tests may be unit tests or system tests.

Generally, you should write unit tests for every function and class in your program. Think of these like simple controls in an experiment. There should in general be multiple unit tests per function, to test the full range of expected inputs. For each argument, you should test at least:

  • one or more typical values;
  • an invalid value, to check that the function raises an exception or returns an error code;
  • an empty value (where the argument is a list, array, vector or other container datatype).

It is not always easy to isolate an individual function or class. One option is to create “mock” or “fake” objects or functions for the function under test to interact with. For example, if testing a function that uses numbers from a random number generator, you can create a fake RNG that always produces a known sequence of values, and pass that as the argument instead of the real RNG.

Even though all unit tests pass, it may be that the units do not work properly together, and therefore you should write a number of system tests to exercise the entire program, or an entire sub-system.

On finding a bug in your program, don’t leap immediately to try to fix it. Rather:

  • find a simple example which demonstrates the bug;
  • turn that example into a regression test (unit or system, as appropriate);
  • check that the test fails with the current version of the code;
  • now fix the bug;
  • check that the regression test passes;
  • check that all the other tests still pass.

Test frameworks

For a typical computational neuroscience project, you will probably end up with several hundred tests. You should run these, and check they all pass, before every commit to your version control system. This means that running all the tests should be a one-line command.

If you are familiar with the make utility, you could write a Makefile, so that:

$ make test

runs all your tests, and tells you at the end which ones have failed.

Most programming languages provide frameworks to make writing and running tests easier, for example:

Python
unittest, nose, doctest
Matlab
xUnit, mlUnit, MUnit, doctest
C++
CppUnit, and many more
Java
Junit, and many more

Also see http://en.wikipedia.org/wiki/List_of_unit_testing_frameworks

Examples

Here is an example of some unit tests for PyNN, a Python API for neuronal network simulation. PyNN provides a module random which provides wrappers for a variety of random number generators, so as to give them all the same interface so that they can be used more-or-less interchangeably.

import pyNN.random as random
import numpy
import unittest

class SimpleTests(unittest.TestCase):
    """Simple tests on a single RNG function."""

    def setUp(self):
        random.mpi_rank=0; random.num_processes=1
        self.rnglist = [random.NumpyRNG(seed=987)]
        if random.have_gsl:
            self.rnglist.append(random.GSLRNG(seed=654))

    def testNextOne(self):
        """Calling next() with no arguments or with n=1 should return a float."""
        for rng in self.rnglist:
            assert isinstance(rng.next(), float)
            assert isinstance(rng.next(1), float)
            assert isinstance(rng.next(n=1), float)

    def testNextTwoPlus(self):
        """Calling next(n=m) where m > 1 should return an array."""
        for rng in self.rnglist:
            self.assertEqual(len(rng.next(5)), 5)
            self.assertEqual(len(rng.next(n=5)), 5)

    def testNonPositiveN(self):
        """Calling next(m) where m < 0 should raise a ValueError."""
        for rng in self.rnglist:
            self.assertRaises(ValueError, rng.next, -1)

    def testNZero(self):
        """Calling next(0) should return an empty array."""
        for rng in self.rnglist:
            self.assertEqual(len(rng.next(0)), 0)

We define a subclass of TestCase which contains several methods, each of which tests the next() method of a random number generator object. The setUp() method is called before each test method - it provides a place to put code that is common to all tests. Note that each test contains one or more assertions about the expected behaviour of next().

Now, here is an example of a regression test (since it tests a particular bug that was found and fixed, to ensure the bug doesn’t reappear later) that is also a system test (as it tests many interacting parts of the code, not a single code unit).

from nose.tools import assert_equal, assert_almost_equal
import pyNN.neuron

def test_ticket168():
    """
    Error setting firing rate of `SpikeSourcePoisson` after `reset()` in NEURON
    http://neuralensemble.org/trac/PyNN/ticket/168
    """
    pynn = pyNN.neuron
    pynn.setup()
    cell = pynn.Population(1, cellclass=pynn.SpikeSourcePoisson, label="cell")
    cell[0].rate = 12
    pynn.run(10.)
    pynn.reset()
    cell[0].rate = 12
    pynn.run(10.)
    assert_almost_equal(pynn.get_current_time(), 10.0, places=11)
    assert_equal(cell[0]._cell.interval, 1000.0/12.0)

For this tests we used the nose framework rather than the unittest framework used in the previous example. This test runs a short simulation, and then, as with unittest, we make assertions about what values we expect certain variables to have.

Test coverage measurement

How do you know when you’ve written enough tests? Tools are available for many languages that will track which lines of code get used when running the test suite (list of tools at http://en.wikipedia.org/wiki/Code_coverage).

For example, the following command runs the test suite for the PyNN package and produces a report in HTML, highlighting which lines of the code have not been covered by the test:

$ nosetests --with-coverage --cover-erase --cover-package=pyNN --cover-html

Test-driven development

As the name suggests, test-driven development (TDD) involves writing tests before writing the code to be tested.

It involves an iterative style of development, repeatedly following this sequence:

  • write a test for a feature you plan to add
  • the test should fail, since you haven’t implemented the feature yet
  • now implement just that feature, and no more
  • the test should now pass
  • now run the entire test suite and check all the tests still pass
  • clean up code as necessary (use the test suite to check you don’t break anything)
  • repeat for the next feature

The advantages of TDD are:

  • makes you think about requirements before writing code
  • makes your code easier to test (written to be testable)
  • ensures that tests for every feature will be written
  • reduces the temptation to over-generalize (“as soon as the test passes, stop coding”)

Of course, writing more tests takes time, but there is some evidence that the total development time is reduced by TDD due to its encouragement of better design and much less time spent debugging.