A lack of transparency and reporting standards in the scientific community has led to increasing and widespread concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The metabolomics community has made substantial efforts to align with FAIR data standards by promoting open data formats, data repositories, online spectral libraries, and metabolite databases. Open data analysis platforms also exist; however, they tend to be inflexible and rely on the user to adequately report their methods and results. To enable FAIR data science in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully integrated with the published work. To ensure broad use within the community such a framework also needs to be inclusive and intuitive for both computational novices and experts alike.
Aim of Review
To encourage metabolomics researchers from all backgrounds to take control of their own data science, mould it to their personal requirements, and enthusiastically share resources through open science.
Key Scientific Concepts of Review
This tutorial introduces the concept of interactive web-based computational laboratory notebooks. The reader is guided through a set of experiential tutorials specifically targeted at metabolomics researchers, based around the Jupyter Notebook web application, GitHub data repository, and Binder cloud computing platform.
“As a researcher who is trying to understand the structure of the Milky Way, I often deal with very large astronomical datasets (terabytes of data, representing almost two billion unique stars). Every single dataset we use is publicly available to anyone, but the primary challenge in processing them is just how large they are. Most astronomical data hosting sites provide an option to remotely query sources through their web interface, but it is slow and inefficient for our science….
To circumvent this issue, we download all the catalogs locally to Harvard Odyssey, with each independent survey housed in a separate database. We use a special python-based tool (the “Large-Survey Database”) developed by a former post-doctoral scholar at Harvard, which allows us to perform fast queries of these databases simultaneously using the Odyssey computing cluster….
To extract information from each hdf5 file, we have developed a sophisticated Bayesian analysis pipeline that reads in our curated hdf5 files and outputs best fits for our model parameters (in our case, distances to local star-forming regions near the sun). Led by a graduate student and co-PI on the paper (Joshua Speagle), the python codebase is publicly available on GitHub with full API documentation. In the future, it will be archived with a permanent DOI on Zenodo. Also on GitHub users will find full working examples of the code, demonstrating how users can read in the publicly available data and output the same style of figures seen in the paper. Sample data are provided, and the demo is configured as a jupyter notebook, so interested users can walk through the methodology line-by-line….”
“In a recent article, I explained why open source isa vital part of open science. As I pointed out, alongside a massive failure on the part of funding bodies to make open source a key aspect of their strategies, there’s also a similar lack of open-source engagement with the needs and challenges of open science. There’s not much that the Free Software world can do to change the priorities of funders. But, a lot can be done on the other side of things by writing good open-source code that supports and enhances open science.
People working in science potentially can benefit from every piece of free software code—the operating systems and apps, and the tools and libraries—so the better those become, the more useful they are for scientists. But there’s one open-source project in particular that already has had a significant impact on how scientists work—Project Jupyter….”
“Perhaps the paper itself is to blame. Scientific methods evolve now at the speed of software; the skill most in demand among physicists, biologists, chemists, geologists, even anthropologists and research psychologists, is facility with programming languages and “data science” packages. And yet the basic means of communicating scientific results hasn’t changed for 400 years. Papers may be posted online, but they’re still text and pictures on a page.
What would you get if you designed the scientific paper from scratch today? …
Software is a dynamic medium; paper isn’t. When you think in those terms it does seem strange that research like Strogatz’s, the study of dynamical systems, is so often being shared on paper …
I spoke to Theodore Gray, who has since left Wolfram Research to become a full-time writer. He said that his work on the notebook was in part motivated by the feeling, well formed already by the early 1990s, “that obviously all scientific communication, all technical papers that involve any sort of data or mathematics or modeling or graphs or plots or anything like that, obviously don’t belong on paper. That was just completely obvious in, let’s say, 1990,” he said. …”