Raw Data in Organic Chemistry Papers/Open Science

Open science is a way of conducting science where anyone can participate and all ideas and data are freely available. It’s a sensational idea for speeding up research. We’re starting to see big projects in several fields around the world, showing the value of opening up the scientific process. We’re doing it, and are on the verge of starting up something in open source drug discovery. The process brings up an important question.

I’m an organic chemist. If I want people to get involved and share data in my field I have to think about how to best share those data. I’m on the board of more than one chemistry journal that is thinking about this right now, in terms of whether to allow/encourage authors to deposit data with their papers. Rather than my formulating recommendations for how we should share chemical data, I wanted to throw the issue open, since there are some excellent chemistry bloggers out there in my field who may already have well-founded opinions in this area. Yes, I’m talking about you.

The standard practice in many good organic chemistry journals is not to share raw data, but typically to ask for PDF versions of important spectra, usually for novel compounds. These naturally serve as a useful tool for the peer-review process, in that a reviewer can easily see whether a compound has been made, and say something of its purity. Such reproductions are not ironclad guarantees that a compound has actually been synthesised, nor that it was the reported process that actually gave rise to that sample. Nonetheless, it’s useful to the reviewer.

Are PDF reproductions useful to science? Well, not really. Peter Murray-Rust talks about PDFs as being “hamburgers”. I think I understand what he means: PDF data are dead – actually very dead, and the cow would be more interesting. You can’t DO anything with a pdf. You can’t take the data and do anything with them. Nobody can re-analyse the spectrum, or zoom in. The spectrum can’t be understood by a machine with any accuracy. Data are lost in conversion.

With raw data, you allow other people to check the data. You also allow them to re-analyze. You allow computers to take the data and do interesting things. If all data were raw, you could ask the interweb, for example, “Find me examples of compounds containing an AB quartet with a coupling constant above 18 Hz. And the molecule needs to contain nitrogen. And synthesized since 1987. And have a melting point.” Maybe that question’s important, maybe not. But with raw data you can at least ask questions of the data.

What are the downsides of posting raw data in organic chemistry, either in papers or to lab book posts:

1) You have to save the data and then upload them. Well, this was a problem in 1995, but not now.

2) The data files are large. Not really. A 1H NMR spectrum is ca. 200KB.

3) It’s a pain. Yes, a little. But we must suffer for things we love.

4) People might find mistakes in my spectra/assignments. Yes. You’re a scientist. This is a Good Thing.

An important fact: For many papers, supporting information is actually public domain, not behind a paywall along with the rest of the paper. The ACS, for example, would, by posting raw data as SI, allow the free exchange of raw spectroscopic data. That would be neat.

I wouldn’t advocate stopping PDF reproductions, necessarily, since these are still useful for review, and for the casual reader. We’re likely to keep using PDF for our electronic lab notebooks, but the data need to be there too. Like ortep and cif – picture and data.

If we can establish that we should be posting raw data, then what kinds of data should we share, and how? This post is meant to outline an answer, and ask for feedback from anyone who’s already thought about this.

1) X-ray crystallography. This is the exception. Data are routinely deposited raw, and may be downloaded. Not always the case, but XRD blazes a trail here.

2) NMR spectroscopy. The big one. IUPAC recommends the JCAMP-DX file format. Jean-Claude Bradley has been a proponent of this format, and has demonstrated how it can be used in all kinds of applications. We’ve played with it, and in one of our recent papers we deposited all the NMR data in this format in the SI. We’ve been posting JCAMP-DX files in our online electronic lab notebooks, e.g. here. My opinion of this file format (both generating it, and reading it) has not been great. There are two formats, I understand, and we found that if we saved the data in the wrong format, we couldn’t read the data with certain programs, but could with others. i.e. we had to get the generation of the file just right. That kind of trickiness, though small, just inevitably means people won’t bother to generate or use the files on a mass scale (unless the journals decide to back it). PDF’s popularity is based on the ubiquity of the reader. JCAMP-DX works well with Jspecview, a free, open source NMR data reader. We’ve not enjoyed our experiences with this, either, though it’s a wonderful endeavour. This led us to look at whether there was a need for saving the data in a particular format, or whether we could just save the raw data, and process those data with a free piece of software. After looking at this with our resident NMR guru, Ian Luck, we found that saving raw data is easy (it’s just a copy and paste of what’s produced by the machine) and that the raw data can be read by free software such as Spinworks or ACDLabs, obviously in addition to our in-house software. This seems ideal? Does anyone have the reason IUPAC prefers a derived data format over the raw data, other than JCAMP-DX is a single file? Aren’t raw data likely to be the most generically useful long-term?

I don’t know if people have experience of this. I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data.

3) IR data. Don’t know if there is a standard. If the file is small, saving raw data could be encouraged. Would allow easy comparisons of fingerprint regions.

4) Mass spectrometry. It’s not clear to me there is a huge advantage here to sharing raw data, for a typical low res experiment?

5) HPLC data. Again, the outputs are fairly simple, and I’m not clear about the advantage of raw data (which I’m assuming would be absorbance vs. time table). Would (perhaps) permit verification that traces have not been cropped to remove pesky impurities.

6) Anything else?