Best practices for reproducible research

Use version control

See Version control.

Test your code

See Testing.

Prioritize code robustness

By “robustness” here, I mean insensitivity to the precise details of the code and environment: if you try to make one part of the code run faster, does the rest of the code have to be changed as well? If you change to a different Linux distribution, or upgrade your operating system, does the code still run and do you get the same results.

Strategies for more robust code are widely employed in professional software development and have been described in many places [e.g. S. McConnell, Code Complete, 2nd ed., Microsoft Press, 2004.] They include:

  • reducing the tightness of the coupling between different parts of the code through a modular design and well-defined interfaces;
  • building on established, widely used, well-tested and easy-to-install libraries; and
  • writing test suites.

In particular, you should design your code to be easily understood by others (where “others” can also include “yourself-in-six-months-time”):

  • write comments explaining anything slightly complex, or why a particular choice was made;
  • write clear documentation;
  • don’t be too clever.

On the latter point, Brian Kernighan said:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

How far should you go in trying to make your code better? Making code more robust has costs in time and manpower, which might not be worth incurring for scientific code with a limited number of users. At the same time, making these time investments up front can save a lot of time later. I don’t have any good guidelines for knowing what the right balance is, other than to step back from the project from time to time, think about how much effort you’ve expended, and decide whether it feels like you’re making too much, or not enough, effort to make your code more reproducible given your goals (e.g., publication).

Maintain a consistent, repeatable computing environment

If you’re moving a computation to a new system, it should be simple and straightforward to set up the environment identically (or nearly so) to that of the original machine. This suggests either using a package-management system - for example, the Debian, Red Hat, or MacPorts systems - or a configuration-management tool (such as Puppet, Chef, or Fabric).

The former provide prebuilt, well-tested software packages in central repositories, thus avoiding the vagaries of downloading and compiling packages from diverse sources and being faster to deploy. The latter enable the definition of recipes for machine setup, which is particularly important when using software that isn’t available in a package-management system, whether because it hasn’t been packaged (for example, because it isn’t widely used or is developed locally for internal use) or the package manager’s version is too outdated.

Separate code from configuration

It’s good practice to cleanly separate code from the configuration and parameters. There are several reasons for this:

  • the configuration and parameters are changed more frequently than the code, so different recording tools are most appropriate for each - for example, using a VCS for the code and a database for the parameters;
  • the parameters are directly modified or controlled by the end user, but the code might not be - this means that parameters can be controlled through different interfaces (configuration files or graphical interfaces);
  • separating the parameters ensures that changes are made in a single place, rather than spread throughout a code base; and
  • the parameters are useful for searching, filtering, and comparing computations made with the same model or analysis method but with different parameters, and storing the parameters separately makes such efforts easier.

Separate model definition from simulation experiment description

Another example of a modular design approach that is specific to modelling and simulation-based science is to completely separate code defining the model from code implementing the experiment you’re doing with the model (what variables to record, how long to simulate the model, what stimulation is used, etc.). In the field of Systems Biology, they even have two separate languages, SBML and SED-ML, for these two tasks.

Share your code

(more on this in the next update to these notes)