Extracting research evidence from publications | EMBL-EBI Train online

“Extracting research evidence from publications Bioinformaticians are routinely handling big data, including DNA, RNA, and protein sequence information. It’s time to treat biomedical literature as a dataset and extract valuable facts hidden in the millions of scientific papers. This webinar demonstrates how to access text-mined literature evidence using Europe PMC Annotations API. We highlight several use cases, including linking diseases with potential treatment targets, or identifying which protein structures are cited along with a gene mutation.

This webinar took place on 5 March 2018 and is for wet-lab researchers and bioinformaticians who want to access scientific literature and data programmatically. Some prior knowledge of programmatic access and common programming languages is recommended.

The webinar covers: Available data (annotation types and sources) (1:50) API operations and parameters and web service outputs (8:08) Use case examples (16:56) How to get help (24:16)

You can download the slides from this webinar here. You can learn more about Europe PMC in our Europe PMC: Quick tour and our previous webinar Europe PMC, programmatically.

For documentation, help and support visit the Europe PMC help pages or download the developer friendly web service guide. For web service related question you can get in touch via the Google group or contact the helpdesk [at] europepmc.org”>help desk.”


“Knowtro has:

  • Identified elements of knowledge shared across research disciplines and mapped the elements critical to the successful transfer of knowledge from document to user.
  • Built a technology-facilitated process whereby complex analyses can be distilled for ease of discovery and use. Findings from published research papers in only the top academic journals are added to the platform daily.
  • Implemented a search results display feature that (1) uses consistent, logical expressions about research findings rather than happenstance excerpts of text, and (2) prioritizes results not by popularity, but according to validity and usefulness (e.g., research design). …”

Release ‘open’ data from their PDF prisons using tabulizer | R-bloggers

“As a political scientist who regularly encounters so-called “open data” in PDFs, this problem is particularly irritating. PDFs may have “portable” in their name, making them display consistently on various platforms, but that portability means any information contained in a PDF is irritatingly difficult to extract computationally.”

Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction

Abstract: “Aside from improving the visibility and accessibility of scientific publications, many scientific Web repositories also assess researchers’ quantitative and qualitative publication performance, e.g., by displaying metrics such as the h-index. These metrics have become important for research institutions and other stakeholders to support impactful decision making processes such as hiring or funding decisions. However, scientific Web repositories typically offer only simple performance metrics and limited analysis options. Moreover, the data and algorithms to compute performance metrics are usually not published. Hence, it is not transparent or verifiable which publications the systems include in the computation and how the systems rank the results. Many researchers are interested in accessing the underlying scientometric raw data to increase the transparency of these systems. In this paper, we discuss the challenges and present strategies to programmatically access such data in scientific Web repositories. We demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data. We would like to emphasize that the scraper included in the tool should only be used if consent was given by the operator of a repository. In our experience, consent is often given if the research goals are clearly explained and the project is of a non-commercial nature.”

Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion

Abstract:  Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s Satori, and Google’s Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have primarily focused on text-based extraction, which can be very noisy. Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods.

StrepHit: Wikidata Statements Validation via References

“StrepHit (pronounced “strep hit”, means “Statement? repherence it!”)[1] is a Natural Language Processing pipeline that harvests structured data from raw text and produces Wikidata statements with reference URLs. Its datasets will feed the primary sources tool.[2]

In this way, we believe StrepHit will dramatically improve the data quality of Wikidata through a reference suggestion mechanism for statement validation, and will help Wikidata to become the gold-standard hub of the Open Data landscape….”