HathiTrust: A Digital Library Revolution Takes Flight

“The phrase “closed until further notice due to COVID-19” has become all too familiar. And, while we have started to grow accustomed to losing access to many resources that typically define our community existence, there’s one that’s particularly crucial to student and faculty researchers: libraries. For some, it may be easy to write off libraries as “nice-to-have.” But for scholars, they are essential. And as library doors began to shutter throughout California and much of the world, the potential impact on the academic community was profound.

Thankfully, the University of California has been preparing for this moment for decades. In 2008, the UC Libraries co-founded HathiTrust, and started contributing scanned copies of books and journals to the new organization. Based at the University of Michigan (U-M), HathiTrust is a large-scale repository of digital content collaboratively created by academic and research institutions. As researchers lost access to vital hard-copy materials, it initiated an Emergency Temporary Access Service (ETAS) to give UC researchers critical access to more than 13 million digital volumes. This revolution has been immediately impactful — and a profound advancement in sharing digital content….”

CSU Explores the Possibility of a Google Books Partnership – Cal schol.com

“Just heard yesterday that our CSU Council of Library Deans (COLD) approved a request I’d made to send records of our entire CSU print holdings to Google Books for evaluation. Google Books will run a comparison of their current digitized holdings against our holdings and evaluate on their end whether a digitization partnership makes sense. If it does, then the CSU will consider whether it might make sense for us as well….”

CSU Explores the Possibility of a Google Books Partnership – Cal schol.com

“Just heard yesterday that our CSU Council of Library Deans (COLD) approved a request I’d made to send records of our entire CSU print holdings to Google Books for evaluation. Google Books will run a comparison of their current digitized holdings against our holdings and evaluate on their end whether a digitization partnership makes sense. If it does, then the CSU will consider whether it might make sense for us as well….”

Why a National Emergency Library Would Have Been Unnecessary – Disruptive Competition Project

“Last week, in response to the COVID-19 pandemic, the Internet Archive announced the National Emergency Library (“NEL”), which expanded digital access to the books in its collection. The New Yorker welcomed it as “a gift to readers everywhere.” Predictably, the Authors Guild, the Copyright Alliance, and the Association of American Publishers condemned the move as infringing copyright. Overlooked in this controversy is that had the 2008 attempted settlement of the litigation over the Google Library Project been approved by the court, the NEL would likely have been unnecessary….”

B2fxxx: Carl Malamud at the Open University

“Without asking publishers’ permission, Malamud has put a lot of stuff online via a project at Jawaharlal Nehru University (JNU) in India – 125 million journal articles from many sources, from the mid 19th century up to the present.

The storage facility is air-gapped and not connected to the internet. Researchers who want access can bring their computers to the facility and text & data mine the materials there. Without having to read or download the articles which is not permitted, they can, nevertheless, draw scientific insights, thereby circumventing any potential copyright problems. The terms and conditions are modeled on those of the HathiTrust and the store specialises in bioinformatics. The access model is 3-tiered:

Tier 0 is air-gapped and pdfs of the articles

Tier 1 is extracted texts and is also air-gapped

Tier 2 is facts. As there is no copyright on facts, this can be made available openly to everyone….

In 2016 the US Supreme Court rejected the Authors Guild’s request to further appeal the decision, ending the more than a decade long litigation. The Authors Guild also tried suing the HathiTrust but were unsuccessful in that case too. The technicalities of the case were different.  One interesting angle was that the court made a point of noting the value of the HathiTrust approach to making the books available to print disabled and visually impaired.

The bottom line was that Google Books and the HathiTrust were given the ok by the US courts.

In the UK text and data mining is permitted only for non-commercial use. …”

Google Books 2020 Update | Communications

“What would you do if Google came to you and said: You have 1 million items that we would like to scan for you and make available to the world?

Over the past two years, a team from Access Services, Stacks Management, Library Technology Services, Information and Technical Services, Harvard Depository, and ReCAP have been attempting to do just that as part of a Harvard Library Digital Strategies and Innovation (DSI) initiative. This project began nearly a decade after our first partnership with Google Books, and it has been an opportunity to approach this work differently — to identify the challenges that we face at each step of the workflow and to look for creative, iterative ways to meet them….

Between 2004 and 2009, Google scanned 891,164 volumes from Harvard. Google has begun reprocessing those materials, enhancing and correcting the raw images and running them through updated OCR to create better, more searchable, machine-readable text.  

As part of this relationship, we are involved in the Google Library Partners group, an active community of our colleagues from peer institutions who also share their materials with Google. As a group we have been able to advocate for and contribute to reviews for handling of materials, quality assurance in scanning, and expanded treatments for items with foldouts or materials of non-traditional size. We have also led a review of how our peers provide access to materials and are actively partnering with HathiTrust to conduct more research into how users find and utilize these materials….”

4.5 Million UC Volumes Digitized & UC’s Most Popular Full View Books in HathiTrust for 2019 – California Digital Library

“The University of California Libraries recently contributed the 4,500,000th digitized book from their collections to HathiTrust Digital Library–a tremendous achievement resulting from 15 years of continuous digitization work. 

The vast majority of these millions of volumes were generated via the Google Books Library Project, which UC joined in 2006. That year the mass digitization of UC’s library collections began in earnest when the Northern Research Library Facility (NRLF) started sending books to the Google Books Library Project for scanning. UC’s work with the Google Books Library Project has never paused–by the time UC’s 3,000,000th volume was digitized in 2010, UC San Diego, UC Santa Cruz, and UCLA had all begun sending collections to Google for digitization. Since then, UC San Francisco, the Southern Research Library Facility (SRLF), UC Davis, UC Berkeley, UC Riverside, UC Irvine, and UC Santa Barbara have all participated, with UC Santa Barbara, UC Berkeley, UC San Diego, UC Riverside, UCLA, and NRLF continuing to do so….”

The Rebirth of Copyright As an Opt-In System? – The Media Institute

“For most of the history of Anglo-American copyright law, copyright was an opt-in system: Authors had to jump through certain regulatory hoops if they wanted to prevent others from copying their works without consent.  These threshold formalities included registering their works with a government agency, affixing a notice to published copies, depositing exemplars with a centralized library, and more.  A failure to comply with the requirements usually meant a diminution in the authors’ copyright entitlement – and in some cases a wholesale forfeiture, under which the works would pass immediately into the public domain.

After some 200 years, however, U.S. copyright abandoned its formal requirements.  Beginning in 1976 and culminating in 1989, Congress responded to complaints from authors (who had sometimes lost protection due to what they viewed as a technicality) and to pressure to join the international copyright community (which forbade most formalities).  Copyright law accordingly underwent a conversion from opt-in to opt-out.

As a result, copyright protection now arises by operation of law, without any action by the author.  As long as a work contains a modicum of originality and is fixed in some tangible form, copyright automatically protects it, and authors must affirmatively disclaim the entitlement if they don’t want its protection.  And these threshold requirements of originality and fixation are incredibly minimal, such that every reader of this essay is probably the owner of hundreds, and quite possibly thousands, of copyrights – in everything from diary entries to doodles….

Of course, any opt-in proposal would face a number of political obstacles, including the fact that predicating copyright protection on any formality (at least for foreign works) is inconsistent with the international copyright conventions to which the United States is a party.  But the Internet does not stop at the border; if opt-in makes sense here, it will make sense abroad as well.  When the United States and its trade partners are done figuring out what to do with Google Books, then, they should consider a return to copyright’s roots.  Make copyright opt-in once more….”

ASECS at 50: Interview with Robert Darnton

“Of the potential solutions, open research practices are among the most promising. The argument is that transparency acts as an implicit quality control process. If others are able to scrutinise our work—not just the final published output, but the underlying data, code, and so on—researchers will be incentivised to ensure these are high quality.

So, if we think that research could benefit from improved quality control, and if we think that open research might have a role to play in this, why aren’t we all doing it? In a word: incentives….”

Sapping Attention: How badly is Google Books search broken, and why?

I periodically write about Google Books here, so I thought I’d point out something that I’ve noticed recently that should be concerning to anyone accustomed to treating it as the largest collection of books: it appears that when you use a year constraint on book search, the search index has dramatically constricted to the point of being, essentially, broken….

What’s going on? I don’t know. I guess I blame the lawyers: I suspect that the reasons have to do with the way the Google books project has become a sort of Herculaneum-on-the-Web, frozen in time at the moment that anti-Books lawsuits erupted in earnest 11 years ago. The site is still littered with pre-2012 branding and icons, and the still-live “project history” page ends with the words “stay tuned…” after describing their annual activity for 2007….”