“The Third Research Excellence Framework, scheduled for the mid-2020s, now has a mandate for open access books. Despite calls from the digitally enlightened, however, most humanities long-form writing remains very much ensconced within the traditions and economics (both symbolic and financial) of the printed book. In this talk, I will discuss the challenges of a migration from conventional books to an open access model and the range of approaches that are currently being taken.
In the age of data mining, distant reading, and cultural analytics, scholars increasingly rely upon automated, algorithm-based procedures in order to parse the exponentially growing databases of digitized textual and visual resources. While these new trends are dramatically shifting the scale of our objects of study, from one book to millions of books, from one painting to millions of images, the most traditional output of humanistic scholarship—the single author monograph—has maintained its institutional pre-eminence in the academic world, while showing the limitations of its printed format. Recent initiatives, such as the AHRC-funded Academic Book of the Future in the UK and the Andrew W. Mellon-funded digital publishing initiative in the USA, have answered the need to envision new forms of scholarly publication on the digital platform, and in particular the need to design and produce a digital equivalent to, or substitute for, the printed monograph. Libraries, academic presses and a number of scholars across a variety of disciplines are participating in this endeavour, debating key questions in the process, such as: What is an academic book? Who are its readers? What can technology do to help make academic books more accessible and sharable without compromising their integrity and durability? Yet, a more fundamental question remains to be answered, as our own idea of what a ‘book’ is (or was) and does (or did) evolves: how can a digital, ‘single-author’ monograph effectively draw from the growing field of digital culture, without losing those characteristics that made it perhaps the most stable form of humanistic culture since the Gutenberg revolution? Our speakers will debate some of these questions and provide their points of view on some of the specific issues involved. After their short presentations, all participants are invited to bring their own ideas about, and experience with, digital publishing to the table.”
“In the longer-term future, one could envision a system where researchers post their scientific contributions; a paper, a single figure, a method, a hypothesis; where we have the potential to make smaller contributions to the global knowledge base and get credit for those contributions in a manner that is more rapid and incremental. This would allow multiple scientists to collaborate and contribute to what we now know of as a single paper. Part of the challenge of the next 10 years is the problem of increasing information overload. Journals in the life sciences are aware that preprints have been around in physics for 25 years, and that the existence of preprints do not diminish the need for journals in that field. It is already impossible for a person to read all the relevant literature in their area, and this will only get harder. We need better tools to read and comprehend the literature, and a lot of these tools will be given by innovations in software and machine learning. My hope is that more of the literature is accessible to text and data mining, which will enhance our ability to understand the literature beyond that of a single human reader….”
“The DPLA is launching an open-source tool for fast, large-scale data harvests from OAI repositories. The tool uses a Spark distributed processing engine to speed up and scale up the harvesting operation, and to perform complex analysis of the harvested data. It is helping us improve our internal workflows and provide better service to our hubs. The Spark OAI Harvester is freely available and we hope that others working with interoperable cultural heritage or science data will find uses for it in their own projects.”
“There’s a vast trove of science out there locked inside the PDF format. From preprints to peer-reviewed literature and historical research, millions of scientific manuscripts today can only be found in a print-era format that is effectively inaccessible to the web of interconnected online services and APIs that are increasingly becoming the digital scaffold of today’s research infrastructure….Extracting key information from PDF files isn’t trivial. …It would therefore certainly be useful to be able to extract all key data from manuscript PDFs and store it in a more accessible, more reusable format such as XML (of the publishing industry standard JATS variety or otherwise). This would allow for the flexible conversion of the original manuscript into different forms, from mobile-friendly layouts to enhanced views like eLife’s side-by-side view (through eLife Lens). It will also make the research mineable and API-accessible to any number of tools, services and applications. From advanced search tools to the contextual presentation of semantic tags based on users’ interests, and from cross-domain mash-ups showing correlations between different papers to novel applications like ScienceFair, a move away from PDF and toward a more open and flexible format like XML would unlock a multitude of use cases for the discovery and reuse of existing research….We are embarking on a project to build on these existing open-source tools, and to improve the accuracy of the XML output. One aim of the project is to combine some of the existing tools in a modular PDF-to-XML conversion pipeline that achieves a better overall conversion result compared to using individual tools on their own. In addition, we are experimenting with a different approach to the problem: using computer vision to identify key components of the scientific manuscript in PDF format….To this end, we will be collaborating with other publishers to collate a broad corpus of valid PDF/XML pairs to help train and test our neural networks….”
“Poster presented at OAI10, University of Geneva, 21 -23 June 2017.”
“The number of scholarly research papers being published is gradually growing; it is estimated that approximately 1.5 million of research papers are produced each year and about 4% of them are offered via Open Access journals. The high volume of scientific papers introduces new opportunities for content discoverability and facilitates a growth in various scientific disciplines via text and data mining (TDM). One of the greatest barriers to TDM is caused by the difficulty of programmatically accessing open access content from a wide range of publishers…”
“Quantitative analysis of digitized text represents an exciting and challenging frontier of data science across a broad spectrum of disciplines. From the analysis of physicians’ notes to identify patients with diabetes, to the assessment of global happiness through the analysis of speech on Twitter, patterns in massive text corpora have led to important scientific advancements.
In this course we will cover several central computational and statistical methods for the analysis of text as data. Topics will include the manipulation and summarization of text data, dictionary methods of text analysis, prediction and classification with textual data, document clustering, text reuse measurement, and statistical topic models….”
A Twitter dialog between Richard Sever and John Wilbanks on whether text mining requires CC-BY, or any other particular open license, or whether it merely requires access to the text and the existing principle (of US copyright law) protecting the extraction of uncopyrightable facts from copyrighted texts. Wilbanks asserts the latter.
“I am a text mining specialist in the Literature Services team of EMBL-EBI. My team runs and maintains the Europe PMC database, an archive of life-science literature. Our job is to make it easy for researchers to find articles and information they need.
I contribute to the development of the text mining infrastructure of the database. My colleagues and I develop methods to annotate articles and design searches by indexing articles based on specific search fields. We are a service-oriented team and work closely with the users to make researchers’ lives easier….”