Tutorial: Managing complex workflows in neural simulation and data analysis¶
Unité de Neurosciences, Information et Complexité (UNIC), Centre National de la Recherche Scientifique, 91198 Gif sur Yvette, France. http://andrewdavison.info
Michael Denker and Sonja Grün
Institute of Neuroscience and Medicine (INM-6), Computational and Systems Neuroscience & Institute for Advanced Simulation (IAS-6), Theoretical Neuroscience, Jülich Research Centre and JARA, 52425 Jülich, Germany. http://www.csn.fz-juelich.de
July 13, 2013
In our attempts to uncover the mechanisms that govern brain processing on the level of interacting neurons, neuroscientists have taken on the challenge of tackling the sheer complexity exhibited by neuronal networks. Neuronal simulations are nowadays performed with a high degree of detail, covering large, heterogeneous networks. Experimentally, electrophysiologists can simultaneously record from hundreds of neurons in complicated behavioral paradigms. The data streams of simulation and experiment are thus highly complex; moreover, their analysis becomes most interesting when considering their intricate correlative structure.
The increases in data volume, parameter complexity, and analysis difficulty represent a large burden for researchers in several respects. Experimenters, who traditionally need to cope with various sources of variability, require efficient ways to record the wealth of details of their experiment (“meta data”) in a concise and machine-readable way. Moreover, to facilitate collaborations between simulation, experiment and analysis there is a need for common interfaces for data and software tool chains, and clearly defined terminologies. Most importantly, however, neuroscientists have increasing difficulties in reliably repeating previous work, one of the cornerstones of the scientific method. At first sight this ought to be an easy task in simulation or data analysis, given that computers are deterministic and do not suffer from the problems of biological variability. In practice, however, the complexity of the subject matter and the long time scales of typical projects require a level of disciplined book-keeping and detailed organization that is difficult to keep up.
The failure to routinely achieve replicability in computational neuroscience (probably in computational science in general, see Donoho et al., 2009 ) has important implications for both the credibility of the field and for its rate of progress (since reuse of existing code is fundamental to good software engineering). For individual researchers, as the example of ModelDB has shown, sharing reliable code enhances reputation and leads to increased impact.
In this tutorial we will identify the reasons for the difficulties often encountered in organizing and handling data, sharing work in a collaboration, and performing manageable, reproducible yet complex computational experiments and data analyses. We will also discuss best practices for making our work more reliable and more easily reproducible by ourselves and others – without adding a huge burden to either our day-to-day research or the publication process.
We will cover a number of tools that can facilitate a reproducible workflow and allow tracking the provenance of results from a published article back through intermediate analysis stages to the original data, models, and/or simulations. The tools that will be covered include Git, Mercurial, Sumatra, VisTrails, odML, Neo. Furthermore, we will highlight strategies to validate the correctness, reliability and limits of novel concepts and codes when designing computational analysis approaches (e.g., , , ).
- The need for better workflows in data analysis of electrophysiological signals
- Reproducible research
- Best practices for managing complex workflows: code
- Version control
- Basic ideas
- Examples of version control systems
- The importance of tracking projects, not individual files
- Advantages of formal version control systems
- Installing Mercurial
- Creating a repository
- Adding files to the repository
- Committing changes
- Viewing the history of changes
- Seeing what’s changed
- Switching between versions
- Giving informative names to versions
- Recap #1
- Making backups
- Working on multiple computers
- Collaborating with others
- Recap #2
- A comparison of Git and Mercurial
- A comparison of Subversion and Mercurial
- Graphical tools
- Web-based tools
- Best practices for managing complex workflows: data
- Parallel data analysis
- Verification (testing)
- Validation: the importance of calibrating statistical correlation methods in data analysis
- Provenance tracking
- The importance of tracking metadata
- Conclusions and outlook
|||Donoho, D.L., Maleki, A., Rahman, I.U., Shahram, M. and Stodden, V. (2009) 15 Years of Reproducible Research in Computational Harmonic Analysis, Computing in Science and Engineering 11:8-18. doi:10.1109/MCSE.2009.15|
|||Pazienti and Grün (2006) Robustness of the significance of spike correlation with respect to sorting errors. Journal of Computational Neuroscience 21:329-342.|
|||Louis et al. (2010) Generation and selection of surrogate methods for correlation analysis. In: Analysis of parallel spike trains. eds. Grün & Rotter. Springer Series in Computational Neuroscience.|
|||Louis et al. (2010) Surrogate spike train generation through dithering in operational time. Front. Comput. Neurosci. 4:127. doi:10.3389/fncom.2010.00127|
This document is licenced under a Creative Commons Attribution 3.0 licence. You are free to copy, adapt or reuse these notes, provided you give attribution to the authors, and include a link to this web page.
https://bitbucket.org/apdavison/reproducible_research_cns - feel free to fork the repository!