Mellon Foundation grant supports development of a plan for using artificial intelligence to plumb the National Archives | Virginia Tech Daily | Virginia Tech

“A key outcome of the planning workshop will be the design of a subsequent pilot project aimed at enhancing access to National Archive collections, including the creation of new tools, techniques, and practices….”

Europe PMC: unlocking the potential of COVID-19 preprints | European Bioinformatics Institute


Europe PMC is now indexing full-text preprints related to the COVID-19 pandemic and the SARS-CoV-2 virus, as well as the underlying data
The project will make COVID-19 scientific literature available as fast as possible in a single repository, in a format that allows text mining
Researchers and healthcare professionals will be able to access and reuse preprints more easily, accelerating research into better treatments or a vaccine….”

S2ORC: The Semantic Scholar Open Research Corpus

“S2ORC is a general-purpose corpus for NLP and text mining research over scientific papers.

We’ve curated a unified resource that combines aspects of citation graphs (i.e. rich paper metadata, abstracts, citation edges) with a full text corpus that preserves important scientific paper structure (i.e. sections, inline citation mentions, references to tables and figures).
Our corpus covers 136M+ paper nodes with 12.7M+ full text papers and connected by 467M+ citation edges by unifying data from many different sources covering many different academic disciplines and identifying open-access papers using services like Unpaywall. …”

Epiviz File Server: Query, Transform and Interactively Explore Data From Indexed Genomic Files – PubMed

Abstract:  Genomic data repositories like The Cancer Genome Atlas (TCGA), Encyclopedia of DNA Elements (ENCODE), Bioconductor’s AnnotationHub and ExperimentHub etc., provide public access to large amounts of genomic data as flat files. Researchers often download a subset of data files from these repositories to perform exploratory data analysis. We developed Epiviz File Server, a Python library that implements an in-situ data query system for local or remotely hosted indexed genomic files, not only for visualization but also data transformation. The File Server library decouples data retrieval and transformation from specific visualization and analysis tools and provides an abstract interface to define computations independent of the location, format or structure of the file. We demonstrate the File Server in two use cases: 1) integration with Galaxy workflows and 2) using Epiviz to create a custom genome browser from the Epigenome Roadmap dataset.

JMIR – Two Decades of Research Using Taiwan’s National Health Insurance Claims Data: Bibliometric and Text Mining Analysis on PubMed | Sung | Journal of Medical Internet Research

Abstract:  Background: Studies using Taiwan’s National Health Insurance (NHI) claims data have expanded rapidly both in quantity and quality during the first decade following the first study published in 2000. However, some of these studies were criticized for being merely data-dredging studies rather than hypothesis-driven. In addition, the use of claims data without the explicit authorization from individual patients has incurred litigation.

Objective: This study aimed to investigate whether the research output during the second decade after the release of the NHI claims database continues growing, to explore how the emergence of open access mega journals (OAMJs) and lawsuit against the use of this database affect the research topics and publication volume and to discuss the underlying reasons.

Methods: PubMed was used to locate publications based on NHI claims data between 1996 and 2017. Concept extraction using MetaMap was employed to mine research topics from article titles. Research trends were analyzed from various aspects, including publication amount, journals, research topics and types, and cooperation between authors.

Results: A total of 4473 articles were identified. A rapid growth in publications was witnessed from 2000 to 2015, followed by a plateau. Diabetes, stroke, and dementia were the top 3 most popular research topics whereas statin therapy, metformin, and Chinese herbal medicine were the most investigated interventions. Approximately one-third of the articles were published in open access journals. Studies with two or more medical conditions, but without any intervention, were the most common study type. Studies of this type tended to be contributed by prolific authors and published in OAMJs.

Conclusions: The growth in publication volume during the second decade after the release of the NHI claims database was different from that during the first decade. OAMJs appeared to provide fertile soil for the rapid growth of research based on NHI claims data, in particular for those studies with two or medical conditions in the article title. A halt in the growth of publication volume was observed after the use of NHI claims data for research purposes had been restricted in response to legal controversy. More efforts are needed to improve the impact of knowledge gained from NHI claims data on medical decisions and policy making.

Ten tips for a text-mining-ready article: How to improve automated discoverability and interpretability

Abstract:  Data-driven research in biomedical science requires structured, computable data. Increasingly, these data are created with support from automated text mining. Text-mining tools have rapidly matured: although not perfect, they now frequently provide outstanding results. We describe 10 straightforward writing tips—and a web tool, PubReCheck—guiding authors to help address the most common cases that remain difficult for text-mining tools. We anticipate these guides will help authors’ work be found more readily and used more widely, ultimately increasing the impact of their work and the overall benefit to both authors and readers. PubReCheck is available at