The trouble with scientific software

Via Nautilus’ excellent Three Sentence Science, I was interested to read Nature’s list of “10 scientists who mattered this year”.

One of them, Sjors Scheres, has written software - RELION - that creates three-dimensional images of protein structures from cryo-electron microscopy images.

I was interested in finding out more about this software: how it had been created, and how the developer(s) had been able to make such a significant improvement in protein imaging.

The Scheres lab has a website. There’s no software section, but in the “Impact” tab is a section about RELION:

“The empirical Bayesian approach to single-particle analysis has been implemented in RELION (for REgularised LIkelihood OptimisatioN). RELION may be downloaded for free from the RELION Wiki). The Wiki also contains its documentation and a comprehensive tutorial.”

I was hoping for a link to GitHub, but at least the source code is available (though the “for free” is worrying, signifying that the default is “not for free”).

On the RELION Wiki, the introduction states that RELION “is developed in the group of Sjors Scheres” (slightly problematic, as this implies that outsiders are excluded, and that development of the software is not an open process).

There’s a link to “Download & install the 1.3 release”. On that page is a link to “Download RELION for free from here”, which leads to a form, asking for name, organisation and email address (which aren’t validated, so can be anything - the aim is to allow the owners to email users if a critical bug is found, but this shouldn’t really be a requirement before being allowed to download the software).

Finally, you get the software: relion–1.3.tar.bz2, containing files that were last modified in February and June this year.

The file is downloaded over HTTP, with no hash provided that would allow verification of the authenticity or correctness of the downloaded file.

The COPYING file contains the GPLv2 license - good!

There’s an AUTHORS file, but it doesn’t really list the contributors in a way that would be useful for citation. Instead, it’s mostly about third-party code:

This program is developed in the group of Sjors H.W. Scheres at the MRC Laboratory of Molecular Biology.

However, it does also contain pieces of code from the following packages:
XMIPP: http:/xmipp.cnb.csic.es
BSOFT: http://lsbr.niams.nih.gov/bsoft/
HEALPIX: http://healpix.jpl.nasa.gov/

Original disclaimers in the code of these external packages have been maintained as much as possible. Please contact Sjors Scheres (scheres@mrc-lmb.cam.ac.uk) if you feel this has not been done correctly. 

This is problematic, because the licenses of these pieces of software aren’t known. They are difficult to find: trying to download XMIPP hits another registration form, and BSOFT has no visible license. At least HEALPIX is hosted on SourceForge and has a clear GPLv2 license.

The CHANGELOG and NEWS files are empty. Apart from the folder name, the only way to find out which version of the code is present is to look in the configure script, which contains PACKAGE_VERSION=‘1.3’. There’s no way to know what has changed from the previous version, as the previous versions are not available anywhere (this also means that it’s impossible to reproduce results generated using older versions of the software).

The README file contains information about how to credit the authors of RELION if it is used: by citing the article Scheres, JMB (2011) (DOI: 10.1016/j.jmb.2011.11.010) which describes how the software works (the version of the software that was available in 2011, at least). This article is Open Access and published under CC-BY v3 (thanks MRC!).

Suggested Improvements

The source code for RELION should be in a public version control system such as GitHub, with tagged releases.

The CHANGELOG should be maintained, so that users can see what has changed between releases.

There should be a CITATION file that includes full details of the authors who contributed to (and should be credited for) development of the software, the name and current version of the software, and any other appropriate citation details.

Each public release of the software should be archived in a repository such as figshare, and assigned a DOI.

There should be a way for users to submit visible reports of any issues that are found with the software.

The parts of the software derived from third-party code should be clearly identified, and removed if their license is not compatible with the GPL.


For more discussion of what is needed to publish citable, re-usable scientific software, see the issues list of Mozilla Science Lab's "Code as a Research Object" project.