Data acquisition

Where possible, store data in nonproprietary software formats

The software needed to read proprietary file formats can become unavailable - companies change the format, go out of business, your lab no longer has a licence... For these reasons you should store your data, where possible, in a widely used, documented, non-proprietary format for which there are multiple software tools available. Examples include plain text or RTF instead of Microsoft Word, csv instead of Excel, HDF5 for time series data.

In neurophysiology, this is easier said than done, since each recording equipment manufacturer tends to provide their own, proprietary file format [give examples, Plexon, etc.]. Nevertheless, converting these proprietary formats to a more standard one will make subsequent data analysis easier (since you will have a wider choice of tools), facilitate collaboration (since your collaborators may not have access to the same software) and ensure you can still read the data in five or ten years time. For neurophysiology data there are now a number of tools that can perform these conversions, including Neo, sigTool and Neuroshare. Do not, however, delete the original file after you have converted it! You should always preserve the original files, so as to be able to prove/check/confirm that no data have been lost/corrupted during the conversion process.

Keep backups on stable media, and in geographically separated locations

  • Backup your data as soon as possible after acquiring it.
  • If possible, try to keep at least two backups, one in the lab and one several kilometres, at least, from the lab. This is in case of fires, earthquakes, etc.
  • At least one backup should be on a stable, long-term storage medium, e.g. CD-R or DVD-R at the time of writing.
  • Every 2 years, check whether your backup media are still readable, and consider moving them to a new medium (e.g. most Macbooks no longer have internal CD/DVD drives).
  • Consider storing your data in an online repository (e.g. G-Node, INCF Dataspace, Dryad; see Databib for a full list).

Have a clear organisation for your data files

There is no one best organisation, but this is something that should ideally be planned in advance, and everyone working on a project needs to agree on. A structure which often works well is PROJECT/YEAR/MONTH/DAY/SUBJECT, but many variations are possible.

  • Ensure everything has a time stamp, using standardized time/date formats, i.e. ISO 8601 YYYY-MM-DDThh:mm:ss
  • Descriptive filenames are often helpful, but don’t try to store too much information in the filenames, it is better to use a “database” (could be a real database, or just a text file with the same name as the data file but a different extension) of metadata which allows all the contextual information to be linked to the file.
  • Have a single canonical location for your files. If you need to make copies (e.g. to work on your laptop) ensure it is clear which is the “master” copy.

Always maintain effective metadata

Metadata are all the pieces of information needed to understand, make sense of and analyze your data - at a minimum, everything which goes in the Materials and Methods section of your paper.

There are many ways to store metadata. The classical way is to keep a paper lab notebook. Although this is sometimes a legal requirement, there are many advantages to storing metadata digitally, either instead of, or as well as, a paper lab notebook. (If medical doctors can increasingly use electronic patient records in place of paper, why can’t scientists?)

In order of increasing complexity, electronic metadata can take one of the following forms:

  • a simple text file, “README” style
  • a spreadsheet
  • a more structured metadata file. odML is a good format for neurophysiology.
  • a home-made relational database. If you’re comfortable with SQL, this is a way to have a tool which is custom-made for your particular experiment
  • dedicated software, either open-source (Yogo, Helmholtz, ...) or commercial (e.g. Ovation).

Note that it is often useful to store metadata in two places: first, as part of, or next to, the datafile, so that the data never becomes separated from its context; secondly in a database of some kind, for ease of search and meta-analysis. This redundancy carries the risk that the two sources of metadata get out of sync, although this can also be seen as a safety check.