“To round off a great Open Access week, we’d like to announce a new interesting project we’ve started. Continuing our efforts in the field of Open Science, Open Knowledge Finland was commissioned by CSC – IT Center for Science and the Finnish Ministry of Education and Culture to implement a Study on the Openness of Scientific Publishers.”
Abstract: We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.
“On the occasion of the proposed EU Copyright reform, which is currently undergoing a process in the European Parliament, representatives from three leading European libraries are asking Julia Reda, Member of the European Parliament with the Pirate Party about the challenges and the impact that the proposed copyright directive will have on libraries, institutional repositories, open science and more.”
Today, the European Alliance for Research Excellence (EARE) and 19 organisations representing European universities, libraries, research organisations and businesses sent an open letter to Members of the Legal Affairs Committee (JURI) in the European Parliament and Deputy Permanent Representatives of the 28 Member States, asking them to revise the Text and Data Mining (TDM) exception in the current copyright reform.
“At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists. Our tools not only facilitate drawing data into an environment where it can readily be manipulated, but also one in which those analyses and methods can be easily shared, replicated, and extended by other researchers….We develop open source R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and repositories that provide real-time metrics of scholarly impact. …Use our packages to acquire data (both your own and from various data sources), analyze it, add in your narrative, and generate a final publication in any one of widely used formats such as Word, PDF, or LaTeX. Combine our tools with the rich ecosystem of existing R packages….”
“The Third Research Excellence Framework, scheduled for the mid-2020s, now has a mandate for open access books. Despite calls from the digitally enlightened, however, most humanities long-form writing remains very much ensconced within the traditions and economics (both symbolic and financial) of the printed book. In this talk, I will discuss the challenges of a migration from conventional books to an open access model and the range of approaches that are currently being taken.
In the age of data mining, distant reading, and cultural analytics, scholars increasingly rely upon automated, algorithm-based procedures in order to parse the exponentially growing databases of digitized textual and visual resources. While these new trends are dramatically shifting the scale of our objects of study, from one book to millions of books, from one painting to millions of images, the most traditional output of humanistic scholarship—the single author monograph—has maintained its institutional pre-eminence in the academic world, while showing the limitations of its printed format. Recent initiatives, such as the AHRC-funded Academic Book of the Future in the UK and the Andrew W. Mellon-funded digital publishing initiative in the USA, have answered the need to envision new forms of scholarly publication on the digital platform, and in particular the need to design and produce a digital equivalent to, or substitute for, the printed monograph. Libraries, academic presses and a number of scholars across a variety of disciplines are participating in this endeavour, debating key questions in the process, such as: What is an academic book? Who are its readers? What can technology do to help make academic books more accessible and sharable without compromising their integrity and durability? Yet, a more fundamental question remains to be answered, as our own idea of what a ‘book’ is (or was) and does (or did) evolves: how can a digital, ‘single-author’ monograph effectively draw from the growing field of digital culture, without losing those characteristics that made it perhaps the most stable form of humanistic culture since the Gutenberg revolution? Our speakers will debate some of these questions and provide their points of view on some of the specific issues involved. After their short presentations, all participants are invited to bring their own ideas about, and experience with, digital publishing to the table.”
“In the longer-term future, one could envision a system where researchers post their scientific contributions; a paper, a single figure, a method, a hypothesis; where we have the potential to make smaller contributions to the global knowledge base and get credit for those contributions in a manner that is more rapid and incremental. This would allow multiple scientists to collaborate and contribute to what we now know of as a single paper. Part of the challenge of the next 10 years is the problem of increasing information overload. Journals in the life sciences are aware that preprints have been around in physics for 25 years, and that the existence of preprints do not diminish the need for journals in that field. It is already impossible for a person to read all the relevant literature in their area, and this will only get harder. We need better tools to read and comprehend the literature, and a lot of these tools will be given by innovations in software and machine learning. My hope is that more of the literature is accessible to text and data mining, which will enhance our ability to understand the literature beyond that of a single human reader….”
“The DPLA is launching an open-source tool for fast, large-scale data harvests from OAI repositories. The tool uses a Spark distributed processing engine to speed up and scale up the harvesting operation, and to perform complex analysis of the harvested data. It is helping us improve our internal workflows and provide better service to our hubs. The Spark OAI Harvester is freely available and we hope that others working with interoperable cultural heritage or science data will find uses for it in their own projects.”
“There’s a vast trove of science out there locked inside the PDF format. From preprints to peer-reviewed literature and historical research, millions of scientific manuscripts today can only be found in a print-era format that is effectively inaccessible to the web of interconnected online services and APIs that are increasingly becoming the digital scaffold of today’s research infrastructure….Extracting key information from PDF files isn’t trivial. …It would therefore certainly be useful to be able to extract all key data from manuscript PDFs and store it in a more accessible, more reusable format such as XML (of the publishing industry standard JATS variety or otherwise). This would allow for the flexible conversion of the original manuscript into different forms, from mobile-friendly layouts to enhanced views like eLife’s side-by-side view (through eLife Lens). It will also make the research mineable and API-accessible to any number of tools, services and applications. From advanced search tools to the contextual presentation of semantic tags based on users’ interests, and from cross-domain mash-ups showing correlations between different papers to novel applications like ScienceFair, a move away from PDF and toward a more open and flexible format like XML would unlock a multitude of use cases for the discovery and reuse of existing research….We are embarking on a project to build on these existing open-source tools, and to improve the accuracy of the XML output. One aim of the project is to combine some of the existing tools in a modular PDF-to-XML conversion pipeline that achieves a better overall conversion result compared to using individual tools on their own. In addition, we are experimenting with a different approach to the problem: using computer vision to identify key components of the scientific manuscript in PDF format….To this end, we will be collaborating with other publishers to collate a broad corpus of valid PDF/XML pairs to help train and test our neural networks….”
“Poster presented at OAI10, University of Geneva, 21 -23 June 2017.”
“The number of scholarly research papers being published is gradually growing; it is estimated that approximately 1.5 million of research papers are produced each year and about 4% of them are offered via Open Access journals. The high volume of scientific papers introduces new opportunities for content discoverability and facilitates a growth in various scientific disciplines via text and data mining (TDM). One of the greatest barriers to TDM is caused by the difficulty of programmatically accessing open access content from a wide range of publishers…”