Open Science Fair

“Open Science is a new research paradigm facing many challenges, mainly the ingrained research habits accompanied by non-incentive institutional and funder reward systems, the lack of embedded tools and services, the connection to non-academic communities.

OSFair2019 is organized as an emblematic initiative of OpenAIRE, co-organized by 3 other EU projects in the area of Open Science: FIT4RRI, EOSC Secretariat and FAIRsFAIR. It is locally curated by the University of Minho.

Open Science Fair will critically showcase the elements required for the transition to Open Science: e-infrastructures and services, policies as guidance for good practices, research flows and new types of activities (disseminate, mine, review, assess, etc.), the roles of the respective actors and their networks….”

NimbleMiner: A Novel Multi-Lingual Text Mining Application

Abstract:  This demonstration showcase will present a novel open access text mining application called NimbleMiner. NimbleMiner’s architecture is language agnostic and it can be potentially applied in multiple languages. The system was applied in a series of recent studies in several languages, including English and Hebrew. The system showed good results in terms of text classification performance when compared to other natural language processing approaches.

Team Awarded Grant to Help Digital Humanities Scholars Navigate Legal Issues of Text Data Mining – UC Berkeley Library Update

“We are thrilled to share that the National Endowment for the Humanities (NEH) has awarded a $165,000 grant to a UC Berkeley-led team of legal experts, librarians, and scholars who will help humanities researchers and staff navigate complex legal questions in cutting-edge digital research….

Until now, humanities researchers conducting text data mining have had to navigate a thicket of legal issues without much guidance or assistance. For instance, imagine the researchers needed to scrape content about Egyptian artifacts from online sites or databases, or download videos about Egyptian tomb excavations, in order to conduct their automated analysis. And then imagine the researchers also want to share these content-rich data sets with others to encourage research reproducibility or enable other researchers to query the data sets with new questions. This kind of work can raise issues of copyright, contract, and privacy law, not to mention ethics if there are issues of, say, indigenous knowledge or cultural heritage materials plausibly at risk. Indeed, in a recent study of humanities scholars’ text analysis needs, participants noted that access to and use of copyright-protected texts was a “frequent obstacle” in their ability to select appropriate texts for text data mining. 

Potential legal hurdles do not just deter text data mining research; they also bias it toward particular topics and sources of data. In response to confusion over copyright, website terms of use, and other perceived legal roadblocks, some digital humanities researchers have gravitated to low-friction research questions and texts to avoid decision-making about rights-protected data. They use texts that have entered into the public domain or use materials that have been flexibly licensed through initiatives such as Creative Commons or Open Data Commons. When researchers limit their research to such sources, it is inevitably skewed, leaving important questions unanswered, and rendering resulting findings less broadly applicable. A growing body of research also demonstrates how race, gender, and other biases found in openly available texts have contributed to and exacerbated bias in developing artificial intelligence tools. …

Data-mining reveals that 80% of books published 1924-63 never had their copyrights renewed and are now in the public domain / Boing Boing

“But there’s another source of public domain works: until the 1976 Copyright Act, US works were not copyrighted unless they were registered, and then they quickly became public domain unless that registration was renewed. The problem has been to figure out which of these works were in the public domain, because the US Copyright Office’s records were not organized in a way that made it possible to easily cross-check a work with its registration and renewal.

For many years, the Internet Archive has hosted an archive of registration records, which were partially machine-readable.

Enter the New York Public Library, which employed a group of people to encode all these records in XML, making them amenable to automated data-mining.

Now, Leonard Richardson (previously) has done the magic data-mining work to affirmatively determine which of the 1924-63 books are in the public domain, which turns out to be 80% of those books; what’s more, many of these books have already been scanned by the Hathi Trust (which uses a limitation in copyright to scan university library holdings for use by educational institutions, regardless of copyright status)….”

Data-mining reveals that 80% of books published 1924-63 never had their copyrights renewed and are now in the public domain / Boing Boing

“But there’s another source of public domain works: until the 1976 Copyright Act, US works were not copyrighted unless they were registered, and then they quickly became public domain unless that registration was renewed. The problem has been to figure out which of these works were in the public domain, because the US Copyright Office’s records were not organized in a way that made it possible to easily cross-check a work with its registration and renewal.

For many years, the Internet Archive has hosted an archive of registration records, which were partially machine-readable.

Enter the New York Public Library, which employed a group of people to encode all these records in XML, making them amenable to automated data-mining.

Now, Leonard Richardson (previously) has done the magic data-mining work to affirmatively determine which of the 1924-63 books are in the public domain, which turns out to be 80% of those books; what’s more, many of these books have already been scanned by the Hathi Trust (which uses a limitation in copyright to scan university library holdings for use by educational institutions, regardless of copyright status)….”

The Right to Read is the Right To Mine: But Not When Blocked by Technical Protection Measures – LIBER

“Our Copyright & Legal Matters Working Group is working with LACA to gather evidence about what happens when Technical Protection Measures (TPMs) block researchers from accessing content because they have attempted text and data mining. 

The survey asks questions related to the type of content blocked, how the issue was solved and how long it took for access to return to business as usual. …”

The Right to Read is the Right To Mine: But Not When Blocked by Technical Protection Measures – LIBER

“Our Copyright & Legal Matters Working Group is working with LACA to gather evidence about what happens when Technical Protection Measures (TPMs) block researchers from accessing content because they have attempted text and data mining. 

The survey asks questions related to the type of content blocked, how the issue was solved and how long it took for access to return to business as usual. …”

The Economic Impacts of Open Science: A Rapid Evidence Assessment | HTML

Abstract:  A common motivation for increasing open access to research findings and data is the potential to create economic benefits—but evidence is patchy and diverse. This study systematically reviewed the evidence on what kinds of economic impacts (positive and negative) open science can have, how these comes about, and how benefits could be maximized. Use of open science outputs often leaves no obvious trace, so most evidence of impacts is based on interviews, surveys, inference based on existing costs, and modelling approaches. There is indicative evidence that open access to findings/data can lead to savings in access costs, labour costs and transaction costs. There are examples of open science enabling new products, services, companies, research and collaborations. Modelling studies suggest higher returns to R&D if open access permits greater accessibility and efficiency of use of findings. Barriers include lack of skills capacity in search, interpretation and text mining, and lack of clarity around where benefits accrue. There are also contextual considerations around who benefits most from open science (e.g., sectors, small vs. larger companies, types of dataset). Recommendations captured in the review include more research, monitoring and evaluation (including developing metrics), promoting benefits, capacity building and making outputs more audience-friendly.

The plan to mine the world’s research papers

Carl Malamud is on a crusade to liberate information locked up behind paywalls — and his campaigns have scored many victories. He has spent decades publishing copyrighted legal documents, from building codes to court records, and then arguing that such texts represent public-domain law that ought to be available to any citizen online. Sometimes, he has won those arguments in court. Now, the 60-year-old American technologist is turning his sights on a new objective: freeing paywalled scientific literature. And he thinks he has a legal way to do it.

Over the past year, Malamud has — without asking publishers — teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi. “This is not every journal article ever written, but it’s a lot,” Malamud says. It’s comparable to the size of the core collection in the Web of Science database, for instance. Malamud and his JNU collaborator, bioinformatician Andrew Lynn, call their facility the JNU data depot.

No one will be allowed to read or download work from the repository, because that would breach publishers’ copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world’s scientific literature to pull out insights without actually reading the text….”

Explainer: What will the new EU copyright rules change for Europe’s Cultural Heritage Institutions | Europeana Pro

“On 17 May 2019 the Directive on Copyright in the Digital Single Market was published in the Official Journal of the European Union. Member States have until the 7 June 2021 to implement the new rules into national law.  In this explainer, Paul Keller, Policy Advisor to Europeana Foundation breaks down the changes these new rules bring to Europe’s Cultural Heritiage insitutions. …

Article 14 of the directive clarifies a fundamental principle of EU copyright law. The article makes it clear that “when the term of protection of a work of visual art has expired, any material resulting from an act of reproduction of that work is not subject to copyright or related rights, unless the material resulting from that act of reproduction is original”. In other words, the directive establishes that museums and other cultural heritage institutions can no longer claim copyright over (digital) reproductions of public domain works in their collections. In doing so the article settles an issue that has sparked quite some controversy in the the cultural heritage sector in the past few year and aligns the EU copyright rules with the principles expressed in Europeana’s Public Domain Charter….

Finally the DSM directive introduces not one but two new Text and Data Mining exceptions (Articles 3 & 4) that will need to be implemented by all Member States. The first exception (Article 3) allows “research organisations and cultural heritage institutions” to make extractions and reproductions of copyright protected works to which they have lawful access “in order to carry out, for the purposes of scientific research, Text and Data Mining”. Under this exception cultural heritage institutions can text and data mine all works that the have in their collections (or to which they have lawful access via other means) as long as this happens for the purpose of scientific research. 

The second exception (Article 4) is not limited to Text and Data Mining for the purpose of scientific research. Instead it allows anyone (including cultural heritage institutions) to make reproductions or extractions of works to which they have lawful access for Text and Data Mining regardless of the underlying purpose. …”