Many of us can now expect to live into our 10th decade. However, with almost half the population over 90 being diagnosed with dementia, the societal and economic costs of cognitive decline are expected to
How and why researchers share data (and why they don’t)
I am pleased to present here results from a survey Wiley conducted into researcher views of data sharing. Earlier this year, we contacted 90,000 researchers across a wide array of disciplines and received more than 2,250 responses from individuals engaged in active research programs. Leading up to the survey, we conducted a series of interviews…
Access to research results, immediately and without restriction, has always been at the heart of PLOS’ mission and the wider Open Access movement. However, without similar access to the data underlying the findings, the article can be of limited use. For this reason, PLOS has always required that authors make their data available to other academic researchers who wish to replicate, reanalyze, or build upon the findings published in our journals.
In an effort to increase access to this data, we are now revising our data-sharing policy for all PLOS journals: authors must make all data publicly available, without restriction, immediately upon publication of the article. Beginning March 3rd, 2014, all authors who submit to a PLOS journal will be asked to provide a Data Availability Statement, describing where and how others can access each dataset that underlies the findings. This Data Availability Statement will be published on the first page of each article.
What do we mean by data?
“Data are any and all of the digital materials that are collected and analyzed in the pursuit of scientific advances.” Examples could include spreadsheets of original measurements (of cells, of fluorescent intensity, of respiratory volume), large datasets such as
next-generation sequence reads, verbatim responses from qualitative studies, software code, or even image files used to create figures. Data should be in the form in which it was originally collected, before summarizing, analyzing or reporting.
What do we mean by publicly available?
All data must be in one of three places:
- the body of the manuscript; this may be appropriate for studies where the dataset is small enough to be presented in a table
- in the supporting information; this may be appropriate for moderately-sized datasets that can be reported in large tables or as compressed files, which can then be downloaded
- in a stable, public repository that provides an accession number or digital object identifier (DOI) for each dataset; there are many repositories that specialize in specific data types, and these are particularly suitable for very large datasets
Do we allow any exceptions?
Yes, but only in specific cases. We are aware that it is not ethical to make all datasets fully public, including private patient data, or specific information relating to endangered species. Some authors also obtain data from third parties and therefore do not have the right to make that dataset publicly available. In such cases, authors must state that “Data is available upon request”, and identify the person, group or committee to whom requests should be submitted. The authors themselves should not be the only point of contact for requesting data.
Where can I go for more information?
The revised data sharing policy, along with more information about the issues associated with public availability of data, can be reviewed in full at:
Image: Open Data stickers by Jonathan Gray
The RECODE project is an EU funded project designed to compile a set of generic guidelines for EU funders to use when forming research data sharing policies. The premise is that publicly funded data should be openly accessible to the public, because they have paid for it. The workshop signalled the end of the first work-package of the project. This studied stakeholder values and ecosystems, that is individual’s and scientific groups’ concepts of open access to data and an examination of current good practice in the area. Other topics such as the ethical considerations and the technological solutions of sharing data are to be tackled in other work-packages. This workshop was of particular interest to the CRC and Sherpa Services because we have recently conducted research into journal research data (the JoRD project; http://jordproject.wordpress.com) and because of the implications for funder’s policies in SHERPA/JULIET.
It was with some relief that we found that our findings about stakeholder perspectives were broadly the same as the RECODE findings; it shows that we were right! I gathered some extra insights from presentations by representatives from participants of the RECODE case studies. For example, there is not a clear difference of opinion on opening out research data between scientific disciplines, but there are many opinions within each discipline. It reminded me of the adage “when you put two academics together you get three different opinions”. It seems to me that it would be easier to sort the factions across disciplinary lines into “pro data sharing”, “contra data sharing” and “no-one would want our data because it is boring”. Another major problem of sharing data that became apparent is that the person who can interpret the data best is the person who collected it because data needs a context. In other words, the knowledge that the data reveals is stuck inside someone’s head, and it is very hard to make that openly accessible. This is the knowledge management problem of intellectual capital. One of the RECODE team expressed it as, a lot of knowledge is lost when you lose another post-doc.
Other issues were raised about technological infrastructure, data licensing, data citation, lack of standardisation of practice within the same fields, the simply practicality of opening huge data sets (the word peta-bytes was bandied about) and whether some sort of reward to an academic could be triggered for openly sharing their data. Overall, the workshop raised some interesting points, and I do not envy the RECODE project team in trying to reach a generic set of open research data guidelines for funders. This is a project that we will follow with great interest.
You can find more about the RECODE project on their website http://recodeproject.eu/
Last month PLOS ONE attended the ISMB/ECCB 2013 conference in Berlin on Intelligent Systems for Molecular Biology. More than 1,500 delegates attended what is the largest conference on computational biology in the world to discuss the latest developments in computational methods that address biological questions.
The opening keynote from PLOS ONE Academic Editor Gil Ast focused on alternative splicing, a mechanism by which several mRNA transcripts are generated from the same mRNA precursor, thus enhancing transcriptome and proteome diversity. He mentioned a paper his group published earlier this year in PLOS ONE, in which they showed that pre-mRNA splicing influences nucleosome organization, suggesting that there is a bi-directional interplay between chromatin organization and splicing. While it is widely accepted that chromatin organization and DNA modification regulate transcription, it is intriguing that splicing can in turn affect chromatin organization, and this may constitute an additional layer of regulation of gene expression. He also presented exciting recent findings showing how pre-mRNA splicing and the creation of new exons in the human genome may be linked to certain genetic disorders and types of cancers.
Understanding the biology of complex human disease is also one of Goncalo Abecasis’s objectives, winner of the ISCB 2013 Overton Prize. Specifically, he is interested in better understanding genetic variation and its connections to human diseases using computational methods and statistical tools. In his talk, he emphasized that the identification and characterization of the genetic variants that affect human traits may be achieved by examining the link between these traits and the complete genome sequences of thousands of individuals. To collect DNA from as many people as possible, he wondered whether we should make use of social media to call for volunteers to send their DNA samples. Are Facebook and Twitter the key to understanding human genetics?
One topic that generated much discussion at the meeting was data sharing. In her talk, Carole Goble called for all scientists to share their data widely as to enable reproducibility, a principle underpinning the scientific method. Several journals, including PLOS ONE, require that all data (including all relevant raw data) described in the manuscript be made freely available to any scientist wishing to use them for the purpose of academic, non-commercial research. Well established and widely supported public repositories already exist for certain types of data such as nucleic acid sequences, and in cases where an appropriate repository does not exist, there are also general data repositories such as Dryad. Assigned accession numbers or digital object identifiers (DOIs) facilitate data citation and ensure accountability. An increasing number of research funding agencies also now support data sharing in the life sciences. Whilst there is indeed increasing discussion to make primary data from published research publicly available, Goble mentioned a paper by Ioannidis and colleagues showing that a substantial proportion of articles published in high-impact journals do not comply (or only weakly comply) with data availability requirements. According to Goble, a lack of data sharing, and thus reproducibility, could lead to an increase in retracted scientific papers.
She also urged the computational biology community to release their “dark data”, i.e. data that is not published and remains hidden on various USB drives and computers, the point being that if shared more people will be able to use these results, increasing visibility, accountability and reproducibility. As highlighted by a recent study, data sharing is not an end in itself, but rather a crucial form of scientific knowledge dissemination.
Keren-Shaul H, Lev-Maor G, Ast G (2013) Pre-mRNA Splicing Is a Determinant of Nucleosome Organization. PLoS ONE 8(1): e53506. doi:10.1371/journal.pone.0053506
Alsheikh-Ali AA, Qureshi W, Al-Mallah MH, Ioannidis JPA (2011) Public Availability of Published Research Data in High-Impact Journals. PLoS ONE 6(9): e24357. doi:10.1371/journal.pone.0024357
Wallis JC, Rolando E, Borgman CL (2013) If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLoS ONE 8(7): e67332. doi:10.1371/journal.pone.0067332
Wikimedia by Angelineri
Modified from Schwartz S, Oren R, Ast G (2011) Detection and Removal of Biases in the Analysis of Next-Generation Sequencing Reads. PLoS ONE 6(1): e16685. doi:10.1371/journal.pone.0016685