Digital Humanities Research Platform

“The Academia Sinica Digital Humanities Research Platform develops digital tools to meet the demands of humanities research, assisting scholars in upgrading the quality of their research. We hope to integrate researchers, research data, and research tools to broaden the scope of research and cut down research time. The Platform provides a comprehensive research environment with cloud computing services, offering all the data and tools scholars require. Researchers can upload texts and authority files, or use others’ open texts and authority files available on the platform. Authority terms possess both manual and automatic text tagging functions, and can be hierarchically categorized. Once text tagging is complete, you can calculate authority term and N-gram statistics, or conduct term co-occurrence analysis, and then present results through data visualization methods such as statistical charts, word clouds, social analysis graphs, and maps. Furthermore, Boolean search, word proximity search, and statistical filtering, enabling researchers to easily carry out textual analysis.”

Opscidia – Free and open access scholarly publishing

“Opscidia is a novel platform for free and Open Access scholarly communication. 

The principle of our platform is to host scientific journals led by an academic editorial committee. Hence, the journal is run by its editorial board while Opscidia provides the software infrastructure, host the journal and assist the communication of the journal free of charge….”

Authors Alliance Petitions for New Exemption to Section 1201 of the DMCA to enable Text and Data Mining Research | Authors Alliance

“Authors Alliance, joined by the Library Copyright Alliance and the American Association of University Professors, filed a petition with the Copyright Office for a new three-year exemption to the DMCA as part of the Copyright Office’s eighth triennial rulemaking process. Our proposed exemption would allow researchers to bypass DRM measures in order to conduct text and data mining research on both literary works that are published electronically and motion pictures. Further details can be found in the full text of the petition, available here.

Text and data mining allows researchers and others to gain new insights into language and culture, scientific inquiry, and civic participation. For example, text and data mining can be used to examine the evolution of language over time or to identify important but overlooked findings in scientific papers….”

Evolving our support for text-and-data mining – Crossref

“Many researchers want to carry out analysis and extraction of information from large sets of data, such as journal articles and other scholarly content. Methods such as screen-scraping are error-prone, place too much strain on content sites and may be unrepeatable or break if site layouts change. Providing researchers with automated access to the full-text content via DOIs and Crossref metadata reduces these problems, allowing for easy deduplication and reproducibility. Supporting text and data mining echoes our mission to make research outputs easy to find, cite, link, assess, and reuse.

In 2013 Crossref embarked on a project to better support Crossref members and researchers with Text and Data Mining requests and access. There were two main parts to the project:

To collect and make available full-text links and publisher TDM license links in the metadata.

To provide a service (TDM click-through service) for Crossref members to post their additional TDM terms and conditions and for researchers to access, review and accept these terms….

To date, 37.5 million works registered with Crossref have both full-text links and TDM license information. We continue to encourage all members to include full-text links and license information in the metadata they register to assist researchers with TDM. You can see how each member is doing via its participation report (e.g. Wiley’s)….

Members are also making subscription content available for text mining (temporarily or otherwise) for specific purposes, such as to help the research community with its response to COVID-19. Back in April we highlighted how this can be achieved by including:

A “free to read” element in the access indicators section of publisher metadata indicating that the content is being made available free-of-charge (gratis)

An assertion element indicating that the content being made available is available free-of-charge….”

Choice360 | Advance Your University’s Research Mission with Text and Data Mining

“Research is evolving from all angles – and academic libraries must ensure their faculty and students have access to the latest content and technology they need to keep up.

In this webinar, you’ll hear from John Cocklin at Dartmouth College and Caroline Muglia at the University of Southern California – two academic librarians who are playing a critical part in advancing their university’s research mission by investing in text and data mining (TDM)….”

Mellon Foundation grant supports development of a plan for using artificial intelligence to plumb the National Archives | Virginia Tech Daily | Virginia Tech

“A key outcome of the planning workshop will be the design of a subsequent pilot project aimed at enhancing access to National Archive collections, including the creation of new tools, techniques, and practices….”

Europe PMC: unlocking the potential of COVID-19 preprints | European Bioinformatics Institute


Europe PMC is now indexing full-text preprints related to the COVID-19 pandemic and the SARS-CoV-2 virus, as well as the underlying data
The project will make COVID-19 scientific literature available as fast as possible in a single repository, in a format that allows text mining
Researchers and healthcare professionals will be able to access and reuse preprints more easily, accelerating research into better treatments or a vaccine….”

S2ORC: The Semantic Scholar Open Research Corpus

“S2ORC is a general-purpose corpus for NLP and text mining research over scientific papers.

We’ve curated a unified resource that combines aspects of citation graphs (i.e. rich paper metadata, abstracts, citation edges) with a full text corpus that preserves important scientific paper structure (i.e. sections, inline citation mentions, references to tables and figures).
Our corpus covers 136M+ paper nodes with 12.7M+ full text papers and connected by 467M+ citation edges by unifying data from many different sources covering many different academic disciplines and identifying open-access papers using services like Unpaywall. …”