“There’s a vast trove of science out there locked inside the PDF format. From preprints to peer-reviewed literature and historical research, millions of scientific manuscripts today can only be found in a print-era format that is effectively inaccessible to the web of interconnected online services and APIs that are increasingly becoming the digital scaffold of today’s research infrastructure….Extracting key information from PDF files isn’t trivial. …It would therefore certainly be useful to be able to extract all key data from manuscript PDFs and store it in a more accessible, more reusable format such as XML (of the publishing industry standard JATS variety or otherwise). This would allow for the flexible conversion of the original manuscript into different forms, from mobile-friendly layouts to enhanced views like eLife’s side-by-side view (through eLife Lens). It will also make the research mineable and API-accessible to any number of tools, services and applications. From advanced search tools to the contextual presentation of semantic tags based on users’ interests, and from cross-domain mash-ups showing correlations between different papers to novel applications like ScienceFair, a move away from PDF and toward a more open and flexible format like XML would unlock a multitude of use cases for the discovery and reuse of existing research….We are embarking on a project to build on these existing open-source tools, and to improve the accuracy of the XML output. One aim of the project is to combine some of the existing tools in a modular PDF-to-XML conversion pipeline that achieves a better overall conversion result compared to using individual tools on their own. In addition, we are experimenting with a different approach to the problem: using computer vision to identify key components of the scientific manuscript in PDF format….To this end, we will be collaborating with other publishers to collate a broad corpus of valid PDF/XML pairs to help train and test our neural networks….”
“As a political scientist who regularly encounters so-called “open data” in PDFs, this problem is particularly irritating. PDFs may have “portable” in their name, making them display consistently on various platforms, but that portability means any information contained in a PDF is irritatingly difficult to extract computationally.”
“As with all good innovators, Peter [Krautzberger, project lead for MathJax] is frustrated. He feels, for example, that advocates of open science focus heavily on sharing of supposedly neutral data, but are still not able to see beyond the PDF. For him open science should be more about how the Web can facilitate communications….”
“I wonder why most publication venues don’t systematically make the LaTeX source for published papers available? (which implies systematically asking authors for the LaTex source)
LaTex source are more machine readable than PDFs, and make it easier for humans to reuse part of it (e.g. math equation or figures), amongst other advantages. I fail to see any downside….”
“Nature Publishing Group, part of Springer Nature, has announced the results of its ground-breaking 12-month content sharing initiative to support collaborative research. The trial has concluded with positive results and the initiative to offer on-platform sharing of the full text of nature.com articles using ReadCube’s enhanced PDF technology will continue indefinitely.
In December 2014, a 12-month content sharing trial was set up to enable subscribers to 49 journals on nature.com to legitimately and conveniently share the full text of articles of interest with colleagues without a subscription via a shareable web link on nature.com, enabled by publishing technology company, ReadCube. The trial was also extended to 100 media outlets and blogs around the world that report on the findings of articles published on nature.com, allowing them to provide their own readers with a link to a full text, read-only view of the original scientific paper….”