The Broader Chemical Community’s View of Uploading Data

Opening up your research to the world means you a) benefit from the opinions and knowledge of The Many as you’re doing the research (rather than months afterwards), and b) have to get your research into shape because The Many can cast a critical eye on what you’re doing in a never-ending process of peer review. Science benefits from these things.

Sharing data is the central part of open science. A necessary, not a sufficient, condition, but central none the less. One cannot be selective about which data to share, because that would mean making a value judgement about what’s important. And what’s unimportant today may important tomorrow. So let’s just share data.

Outside of open science (and our community of zealots) we should also be encouraging people to share data as part of traditional research publications. Many of us do, as PDFs of NMR spectra, for example. This common practice is very useful for the refereeing process, to determine whether the science is valid. Sharing PDFs is less useful for science because the data in a PDF are dead. Live data can be played with, PDFs can’t. Puppy vs. roadkill. Cow vs. hamburger. We should be submitting raw data to journals along with our traditional reviewer-friendly supporting information. And we should be asking journals to keep the data outside the paywall.

I recently asked a question about how we should share chemical data – i.e. what data formats would be best. There is an IUPAC standard which we’ve not particularly enjoyed, and we’ve been thinking about just sharing data in as raw a state as possible. Other people picked up on this and provided very useful comments and suggestions here, here and here, as well as in comments to the original post. Thanks guys.

There’s no consensus, though the IUPAC standard does have its fans, and (I didn’t realise) is a data format that can be used for other spectroscopic techniques rather than simply NMR. I won’t pretend to understand how that’s possible, but it’s interesting.

We’ll keep thinking about this. For our current ELNs we’ll continue to post data and see how we go over time.

However, I think we need to address a background question: how will any solution scale to the broader chemical community? I’m not talking about the technical issue of file format, or what to share. I’m talking about psychology.

My theory is: any solution to data sharing that relies on chemists uploading their data to a central point, or in a proscribed way, will not scale.

Solution: we need to be building solutions that can find chemical data on the web, extract the data and index them. i.e. a solution that involves as little electronic work as possible for the experimentalist.

I think that this is probably very hard, but can’t judge. I don’t know how you get a bot to understand that there is chemical content on a web page, and extract it automatically. I don’t know how you can trust the results. I don’t know what happens when the source web page dies. But I know science needs tools of this kind, and that this is what we’ll be doing in 20 years’ time.

Analogy: Google. Imagine if Google had said “Once you’ve created a web page, just send us the details and we’ll put it in our index.”

If we, for a moment, look outside our Band of Open Source Brothers, we see a vast community of talented researchers in chemistry who spend their time making molecules in the lab. To date this community either does not see or does not agree with the advantages of doing science openly, or has no need/wish to engage with the issues, or does not see the advantage of sharing data in traditional publications in a way other than PDFs. I see those advantages, and many people I talk to see the advantages, but the vast majority of chemists do not, yet, for whatever reason. Why, then, would a chemist, who is already busy with work, life, family, thesis writing and everything else, sit down and start uploading data to the web? Remember that our chemist, representing 95% of chemists out there, does not agree that doing so is worthwhile (or because they’re not allowed to). There is no incentive. For the incentive to take hold requires the world to change, and that’s going to take some time. It’s also the case that the community is not used to it. We’re used to publishing papers, then having the data appear, as if my magic, in SciFinder or Beilstein or whatever. So we have no problem providing the paper and the data, but we expect others to make it searchable.

Now, I’m a serious fan of Chemspider, and I’ve just come across Figshare. Excellent services. They’re pioneering. They must succeed, and I think to succeed there needs to be a shift from “hoping for user upload” to “bloodthirsty, active data extraction from disparate sites”, however difficult that might be. Anthony, Mark – I’d like to know your thoughts and what I can do to help. I’m whining because I want your work to flourish.

People I speak to then say sentences that begin “But all you have to do is…” and “But it’s easy – you just…” – no. It’s no good. Expecting chemists to upload their data to a specific place will not scale. If there’s an activation energy barrier for me, there’s a orbital-forbidden transition state for most people.

Rather, data need to be posted openly somewhere online:

a) To a lab book if you’re an open scientist

b) To an institutional repository if you’ve just finished a thesis, or generally want to share

c) To supporting information files, if you’re the author of a paper in a journal

whatever is easiest and convenient locally. i.e. there can be a bunch of different solutions.

We can rely on this happening, because this is easy, and related to what chemists are doing right now. We can say to chemists: “Hey, do the research, post data. Wherever you want – either on your own webpage, or provide the data when you submit publications and ensure that the data are not behind a paywall. Here are some guidelines on file formats, but really just post the data. We’ll find the data. We’ll tag them so that other people can find them, and then you’ll see how great it is that you shared the data.

If there is a way of doing this, or finding data people post wherever, and automatically making sense of it, we’ll start seeing some big changes to how things are done. People will start to see the benefits of openness in itself, and we’ll start to move towards an astonishing change – chemists collaborating in real time by finding other people who are working on their molecule/reaction right now.