Science Beam – using computer vision to extract PDF data | Labs | eLife

“There’s a vast trove of science out there locked inside the PDF format. From preprints to peer-reviewed literature and historical research, millions of scientific manuscripts today can only be found in a print-era format that is effectively inaccessible to the web of interconnected online services and APIs that are increasingly becoming the digital scaffold of today’s research infrastructure….Extracting key information from PDF files isn’t trivial. …It would therefore certainly be useful to be able to extract all key data from manuscript PDFs and store it in a more accessible, more reusable format such as XML (of the publishing industry standard JATS variety or otherwise). This would allow for the flexible conversion of the original manuscript into different forms, from mobile-friendly layouts to enhanced views like eLife’s side-by-side view (through eLife Lens). It will also make the research mineable and API-accessible to any number of tools, services and applications. From advanced search tools to the contextual presentation of semantic tags based on users’ interests, and from cross-domain mash-ups showing correlations between different papers to novel applications like ScienceFair, a move away from PDF and toward a more open and flexible format like XML would unlock a multitude of use cases for the discovery and reuse of existing research….We are embarking on a project to build on these existing open-source tools, and to improve the accuracy of the XML output. One aim of the project is to combine some of the existing tools in a modular PDF-to-XML conversion pipeline that achieves a better overall conversion result compared to using individual tools on their own. In addition, we are experimenting with a different approach to the problem: using computer vision to identify key components of the scientific manuscript in PDF format….To this end, we will be collaborating with other publishers to collate a broad corpus of valid PDF/XML pairs to help train and test our neural networks….”

Panelists discuss JATS for Reuse for OASPA Webinar – OASPA

“JATS4R (JATS for Reuse) is an inclusive group of publishers, vendors, and other interested organisations who use the NISO Journal Article Tag Suite (JATS) XML standard. On Monday 13th March, 2017, OASPA hosted a webinar on the history, goals and recent work of JATS4R, the importance of participation and outreach around JATS4R, and to provide a platform for discussions on how the initiative can be advanced in the future.”

NIH Manuscript Collection Optimized for Text-Mining and More

“NIH-supported scientists have made over 300,000 author manuscripts available on PubMedCentral (PMC) since 2008. Now, NIH is making these papers accessible to the public in a format that will allow robust text analyses.

You can download the entire PMC collection of NIH-supported author manuscripts as a package in either XML or plain text formats….”

Inconsistent XML as a Barrier to Reuse of Open Access Content – Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013 – NCBI Bookshelf

Abstract:  In this paper, we will describe the current state of some of the tagging of articles within the PMC Open Access subset. As a case study, we will use our experiences developing the Open Access Media Importer, a tool to harvest content from the OA subset for automated upload to Wikimedia Commons.

Tagging inconsistencies stretch across several aspects of the articles, ranging from licensing to keywords to the media types of supplementary materials. While all of these complicate large-scale reuse, the unclear licensing statements had the greatest impact, requiring us to implement text mining-like algorithms in order to accurately determine whether or not specific content was compatible with reuse on Wikimedia Commons.
Besides presenting examples of incorrectly tagged XML from a range of publishers, we will also explore past and current efforts towards standardization of license tagging, and we will describe a set of recommendations related to tagging practices of certain data, to ensure that it is both compatible with existing standards, and consistent and machine-readable.

JATS4R – JATS for Reuse

“JATS4R aims to help standardisation of xml used in scientific publishing workflows. We take specific areas of interest (such as licenses or author contributions) and work to define best practice tagging guidelines, along with tools that can help publishers identify whether their content is compliant with those best practices.

By doing this, we hope to make the research literature more accessible for data miners, and to lower costs when content needs to be exchanged or moved at scale, by bringing more consistency to the way the literature is tagged….”

Inconsistent XML as a Barrier to Reuse of Open Access Content

Abstract:  In our paper, we described the current state of some of the tagging of articles within the PMC Open Access subset. As a case study, we used our experiences developing the Open Access Media Importer, a tool to harvest content from the OA subset and automatically upload it to Wikimedia Commons.

Tagging inconsistencies stretch across several aspects of the articles, ranging from licensing to keywords to the media types of supplementary materials. While all of these complicate large-scale reuse, the unclear licensing statements required us to implement text mining-like algorithms in order to accurately determine whether or not specific content was compatible with reuse on Wikimedia Commons.
Besides presenting examples of incorrectly tagged XML from a range of publishers, we will also explore past and current efforts towards standardization of license tagging, and we will describe a set of recommendations for generators of content on how best to tag certain data so that it is both compatible with existing standards, and consistent and machine-readable.