Raw Data in Organic Chemistry Papers/Open Science
Open science is a way of conducting science where anyone can participate and all ideas and data are freely available. It’s a sensational idea for speeding up research. We’re starting to see big projects in several fields around the world, showing the value of opening up the scientific process. We’re doing it, and are on the verge of starting up something in open source drug discovery. The process brings up an important question.
I’m an organic chemist. If I want people to get involved and share data in my field I have to think about how to best share those data. I’m on the board of more than one chemistry journal that is thinking about this right now, in terms of whether to allow/encourage authors to deposit data with their papers. Rather than my formulating recommendations for how we should share chemical data, I wanted to throw the issue open, since there are some excellent chemistry bloggers out there in my field who may already have well-founded opinions in this area. Yes, I’m talking about you.
The standard practice in many good organic chemistry journals is not to share raw data, but typically to ask for PDF versions of important spectra, usually for novel compounds. These naturally serve as a useful tool for the peer-review process, in that a reviewer can easily see whether a compound has been made, and say something of its purity. Such reproductions are not ironclad guarantees that a compound has actually been synthesised, nor that it was the reported process that actually gave rise to that sample. Nonetheless, it’s useful to the reviewer.
Are PDF reproductions useful to science? Well, not really. Peter Murray-Rust talks about PDFs as being “hamburgers”. I think I understand what he means: PDF data are dead – actually very dead, and the cow would be more interesting. You can’t DO anything with a pdf. You can’t take the data and do anything with them. Nobody can re-analyse the spectrum, or zoom in. The spectrum can’t be understood by a machine with any accuracy. Data are lost in conversion.
With raw data, you allow other people to check the data. You also allow them to re-analyze. You allow computers to take the data and do interesting things. If all data were raw, you could ask the interweb, for example, “Find me examples of compounds containing an AB quartet with a coupling constant above 18 Hz. And the molecule needs to contain nitrogen. And synthesized since 1987. And have a melting point.” Maybe that question’s important, maybe not. But with raw data you can at least ask questions of the data.
What are the downsides of posting raw data in organic chemistry, either in papers or to lab book posts:
1) You have to save the data and then upload them. Well, this was a problem in 1995, but not now.
2) The data files are large. Not really. A 1H NMR spectrum is ca. 200KB.
3) It’s a pain. Yes, a little. But we must suffer for things we love.
4) People might find mistakes in my spectra/assignments. Yes. You’re a scientist. This is a Good Thing.
An important fact: For many papers, supporting information is actually public domain, not behind a paywall along with the rest of the paper. The ACS, for example, would, by posting raw data as SI, allow the free exchange of raw spectroscopic data. That would be neat.
I wouldn’t advocate stopping PDF reproductions, necessarily, since these are still useful for review, and for the casual reader. We’re likely to keep using PDF for our electronic lab notebooks, but the data need to be there too. Like ortep and cif – picture and data.
If we can establish that we should be posting raw data, then what kinds of data should we share, and how? This post is meant to outline an answer, and ask for feedback from anyone who’s already thought about this.
1) X-ray crystallography. This is the exception. Data are routinely deposited raw, and may be downloaded. Not always the case, but XRD blazes a trail here.
2) NMR spectroscopy. The big one. IUPAC recommends the JCAMP-DX file format. Jean-Claude Bradley has been a proponent of this format, and has demonstrated how it can be used in all kinds of applications. We’ve played with it, and in one of our recent papers we deposited all the NMR data in this format in the SI. We’ve been posting JCAMP-DX files in our online electronic lab notebooks, e.g. here. My opinion of this file format (both generating it, and reading it) has not been great. There are two formats, I understand, and we found that if we saved the data in the wrong format, we couldn’t read the data with certain programs, but could with others. i.e. we had to get the generation of the file just right. That kind of trickiness, though small, just inevitably means people won’t bother to generate or use the files on a mass scale (unless the journals decide to back it). PDF’s popularity is based on the ubiquity of the reader. JCAMP-DX works well with Jspecview, a free, open source NMR data reader. We’ve not enjoyed our experiences with this, either, though it’s a wonderful endeavour. This led us to look at whether there was a need for saving the data in a particular format, or whether we could just save the raw data, and process those data with a free piece of software. After looking at this with our resident NMR guru, Ian Luck, we found that saving raw data is easy (it’s just a copy and paste of what’s produced by the machine) and that the raw data can be read by free software such as Spinworks or ACDLabs, obviously in addition to our in-house software. This seems ideal? Does anyone have the reason IUPAC prefers a derived data format over the raw data, other than JCAMP-DX is a single file? Aren’t raw data likely to be the most generically useful long-term?
I don’t know if people have experience of this. I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data.
3) IR data. Don’t know if there is a standard. If the file is small, saving raw data could be encouraged. Would allow easy comparisons of fingerprint regions.
4) Mass spectrometry. It’s not clear to me there is a huge advantage here to sharing raw data, for a typical low res experiment?
5) HPLC data. Again, the outputs are fairly simple, and I’m not clear about the advantage of raw data (which I’m assuming would be absorbance vs. time table). Would (perhaps) permit verification that traces have not been cropped to remove pesky impurities.
6) Anything else?
Jean-Claude Bradley 11:52 pm on August 7, 2011 Permalink |
Mat – you can share JCAMP-DX spectra without asking people to download software. Just upload the file to any open server and append the url from service #4 here:
http://onswebservices.wikispaces.com/NMR
It uses the non-Java ChemDoodle components so should work on Mac, many smartphones, etc. In your case I believe the issue was spaces in the filename – if you remove those it should work fine – let me know. Click on this link to see what it should look like:
http://tinyurl.com/432tdbn
As for other forms of spectral data you can do pretty much all of them using JCAMP-DX, as shown in our SpectralGame options (C NMR, IR, UV)
http://spectralgame.com/
MS can be done too.
Another advantage of having the NMR in JCAMP-DX is that you can call web services to automatically integrate within a Google Spreadsheet, for calculating solubility for example: See link #3
http://onswebservices.wikispaces.com/NMR
Peter Murray-Rust 12:19 am on August 8, 2011 Permalink |
Mat, great post – answering various points:
>>>Open science is a way of conducting science where anyone can participate and all ideas and data are freely available. It’s a sensational idea for speeding up research. We’re starting to see big projects in several fields around the world, showing the value of opening up the scientific process. We’re doing it, and are on the verge of starting up something in open source drug discovery. The process brings up an important question.
I am exciting about the OSDD effort(s) and think there is a lot of Open technology they can use.
>>>I’m an organic chemist. If I want people to get involved and share data in my field I have to think about how to best share those data. I’m on the board of more than one chemistry journal that is thinking about this right now, in terms of whether to allow/encourage authors to deposit data with their papers.
Many already do “require” PDFs. There is no agreed way of doing it, but if what you mean is depositing JCAMPs then YES. The OS community can hack any variants
>>>1) You have to save the data and then upload them. Well, this was a problem in 1995, but not now.
agreed – trivial in time and size of files
2) The data files are large. Not really. A 1H NMR spectrum is ca. 200KB.
>>> 3) It’s a pain. Yes, a little. But we must suffer for things we love.
see below
>>>4) People might find mistakes in my spectra/assignments. Yes. You’re a scientist. This is a Good Thing.
Yes – and some bad chemistry has been detected and corrected
>>>An important fact: For many papers, supporting information is actually public domain, not behind a paywall along with the rest of the paper. The ACS, for example, would, by posting raw data as SI, allow the free exchange of raw spectroscopic data. That would be neat.
The ACS requires CIFs and I congratulate them. If they could just extend that to JCAMPs and computational logfiles that would almost solve everything
>>>1) X-ray crystallography. This is the exception. Data are routinely deposited raw, and may be downloaded. Not always the case, but XRD blazes a trail here.
True for all OA journals (but not much crystallography here except IUCr ActaE), RSC, IUCr, ACS require CIFs (Applause). Wiley, Springer, Elsevier do not publish this supplemental data. Only available from CCDC and then not in bulk without subscription.
>>>2) NMR spectroscopy. The big one. IUPAC recommends the JCAMP-DX file format. Jean-Claude Bradley has been a proponent of this format, and has demonstrated how it can be used in all kinds of applications. We’ve played with it, and in one of our recent papers we deposited all the NMR data in this format in the SI. We’ve been posting JCAMP-DX files in our online electronic lab notebooks, e.g. here. My opinion of this file format (both generating it, and reading it) has not been great. There are two formats, I understand, and we found that if we saved the data in the wrong format, we couldn’t read the data with certain programs, but could with others. i.e. we had to get the generation of the file just right.
Don’t fully understand this. There are actually several formats but the OpenSource software reads all of them. CML-Spect supports these and is readable by JSpecview. This need not be a problem if people have the will to solve it.
>>>I don’t know if people have experience of this. I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data.
I have an idea, which I think will fly.
>>>3) IR data. Don’t know if there is a standard. If the file is small, saving raw data could be encouraged. Would allow easy comparisons of fingerprint regions.
JCAMP will hack this
>>>4) Mass spectrometry. It’s not clear to me there is a huge advantage here to sharing raw data, for a typical low res experiment?
JCAMP will do this for “1-D” spectra (e.g. not involving GC or multiple steps
>>>5) HPLC data. Again, the outputs are fairly simple, and I’m not clear about the advantage of raw data (which I’m assuming would be absorbance vs. time table). Would (perhaps) permit verification that traces have not been cropped to remove pesky impurities.
Again it wouldn’t take much to solve this
>>>6) Anything else?
I think we should use FigShare (see http://blogs.ch.cam.ac.uk/pmr/2011/08/03/figshare-how-to-publish-your-data-to-write-your-thesis-quicker-and-better/ ) and I’ll explain why in my blog in a day or so
Rifleman_82 2:27 am on August 8, 2011 Permalink |
I’ve recently encountered the problem you mentioned with .jdx files when i tried to upload some spectra to ChemSpider. It’s a shame that the journals are not interested in becoming data repositories of experimental data. Perhaps not “Open Notebook”, but uploading spectra of known compounds to ChemSpider is helpful for other workers. A way to check if whatever you made is authentic, for example. I’m not sure how hard Tony Williams looks at the data. For what it’s worth, he’s an NMR specialist. It’ll be nice if they can have a front end which allows it to act like an open source SciFinder/Reaxys/SDBS.
Rifleman_82 2:28 am on August 8, 2011 Permalink |
Some of the spectra I’ve uploaded:
http://www.chemspider.com/UpdateBox.aspx?id=599518&type=spectra
http://www.chemspider.com/Chemical-Structure.24636.html
Unilever Centre for Molecular Informatics, Cambridge - Figshare meets Open Drug Discovery « petermr's blog 8:02 am on August 8, 2011 Permalink |
[…] (likely to be an increasingly common theme here). He asked for my thoughts on his blog post https://intermolecular.wordpress.com/2011/08/07/raw-data-in-organic-chemistry-papersopen-science/ which starts: Open science is a way of conducting science where anyone can participate and all […]
Alex 6:47 pm on August 8, 2011 Permalink |
>It’s a pain. Yes, a little. But we must suffer for things we love.
Now I know what I will say to my girlfirend/workers/friends.
Unilever Centre for Molecular Informatics, Cambridge - Why we need data repositories: prevention of Scientific Fraud (ACS and others please respond) « petermr's blog 7:47 pm on August 9, 2011 Permalink |
[…] peaks in the wrong place, etc.). Mat Todd has asked the ACS if they will accept digital spectra (https://intermolecular.wordpress.com/2011/08/07/raw-data-in-organic-chemistry-papersopen-science/ ) and … I was in touch with one of the ACS journals recently, who indicated that their view was […]
Richard Kidd 9:19 pm on August 9, 2011 Permalink |
Hi Matt
The RSC are more than happy to get the raw data alongside papers and host with the (Open) ESI, with a couple of provisos –
1. We’d start having difficulties if the files got too big – which I think is where DataCite comes in – but no problem for jcamps, excel files etc
2. For peer review purposes we do need pdf versions of the table/spectra – not necessarily ideal, and building in the viewers for the data file isn’t impossible – but ease of review is important
And also – following Rifleman_82’s post – anyone can load up their jcamp spectra against a compound (or add a new compound then attach the spectra) on the RSC’s ChemSpider, and mark it as Open Data.
Am happy to follow up with
ChemConnector Blog 10:15 pm on August 15, 2011 Permalink |
[…] = 'wpp-261'; var addthis_config = {"data_track_clickback":true};@mattoddchem has posted “Raw Data in Organic Chemistry Papers/Open Science” regarding his wanting to “share data in my field I have to think about how to best […]
Antony Williams, ChemConnector 10:24 pm on August 15, 2011 Permalink |
Mat, Great post…similar questions are being asked by many people already. I have responded to your comments here http://tinyurl.com/3vngnwd. I think overall for the problem you are out to solve that RSC ChemSpider is already most of the way there, certainly in terms of the majority of the data you are discussing. We support spectral data and CIFs already. We could manage the raw data files directly (meaning binary file vendor formats as acquired…FIDs for example) but I don’t think most people would care. They would want the processed NMR spectra. But, of course, spectra are better than PDF files. I’d love to get your data collection from the PLoS article to host on ChemSpider. At present I have to download them one at a time, draw the structure and upload one at a time but we can do it in batch if you want to provide the batch of files to us. We’ve done it for hundreds of pairs of spectra and structures before now. Thanks
Putting organic chemistry data on the web | ylioja.net 2:58 pm on September 12, 2011 Permalink |
[…] There’s been quite a bit of discussion about this recently. My boss, Mat Todd, recently wrote about it and this was followed up by ChemConnector, Peter Murray-Rust and others in the comments […]
The Broader Chemical Community’s View of Uploading Data « Intermolecular 9:14 pm on September 18, 2011 Permalink |
[…] Comments Putting organic chem… on Raw Data in Organic Chemistry …Antony Williams, Che… on Raw Data in Organic Chemistry …ChemConnector Blog on Raw Data […]
Figshare: a new way to publish scientific research data « Wellcome Trust Blog 10:54 pm on January 18, 2012 Permalink |
[…] This is where our venture comes in. Figshare is a free service allowing researchers to publish all of their research outputs to the web in seconds in an easily citable, sharable and discoverable manner. We aim to show researchers that they can get the credit for all of their research, whilst at the same time moving research forward in a more efficient manner. […]