WarSampo: Publishing and Using Linked Open Data about the Second World War

“The WarSampo system 1) initiates and fosters large scale Linked Open Data (LOD) publication of WW2 data from distributed, heterogeneous data silos and 2) demonstrates and suggests its use in applications and DH research. WarSampo is to our best knowledge the first large scale system for serving and publishing WW2 LOD on the Semantic Web for machine and human users. Its knowledge graph metadata contains over 9 million associations (triples) between data items including, e.g., a complete set of over 95,000 death records of Finnish WW2 soldiers, 160,000 authentic photos taken during the war, 32,000 historical places on historical maps, 23,000 war diaries of army units, and 3,400 memoir articles written by the veterans after the war. WarSampo data comes from several Finnish organizations and sources, such as National Archives, Defense Forces, Land Survey of Finland, Wikipedia/DBpedia, text books, and magazines.

WarSampo has two separate components: 1) WarSampo Data Service for machines and 2) WarSampo Semantic Portal with various applications for human users.”

Science Beam – using computer vision to extract PDF data | Labs | eLife

“There’s a vast trove of science out there locked inside the PDF format. From preprints to peer-reviewed literature and historical research, millions of scientific manuscripts today can only be found in a print-era format that is effectively inaccessible to the web of interconnected online services and APIs that are increasingly becoming the digital scaffold of today’s research infrastructure….Extracting key information from PDF files isn’t trivial. …It would therefore certainly be useful to be able to extract all key data from manuscript PDFs and store it in a more accessible, more reusable format such as XML (of the publishing industry standard JATS variety or otherwise). This would allow for the flexible conversion of the original manuscript into different forms, from mobile-friendly layouts to enhanced views like eLife’s side-by-side view (through eLife Lens). It will also make the research mineable and API-accessible to any number of tools, services and applications. From advanced search tools to the contextual presentation of semantic tags based on users’ interests, and from cross-domain mash-ups showing correlations between different papers to novel applications like ScienceFair, a move away from PDF and toward a more open and flexible format like XML would unlock a multitude of use cases for the discovery and reuse of existing research….We are embarking on a project to build on these existing open-source tools, and to improve the accuracy of the XML output. One aim of the project is to combine some of the existing tools in a modular PDF-to-XML conversion pipeline that achieves a better overall conversion result compared to using individual tools on their own. In addition, we are experimenting with a different approach to the problem: using computer vision to identify key components of the scientific manuscript in PDF format….To this end, we will be collaborating with other publishers to collate a broad corpus of valid PDF/XML pairs to help train and test our neural networks….”

Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly articles

Abstract:  Purpose: this paper introduces the Research Articles in Simplified HTML (or RASH), which is a Web-first format for writing HTML-based scholarly papers; it is accompanied by the RASH Framework, i.e. a set tools for interacting with RASH-based articles. The paper also presents an evaluation that involved authors and reviewers of RASH articles, submitted to the SAVE-SD 2015 and SAVE-SD 2016 workshops.

Design: RASH has been developed in order to: be easy to learn and use; share scholarly documents (and embedded semantic annotations) through the Web; support its adoption within the existing publishing workflow

Findings: the evaluation study confirmed that RASH can already be adopted in workshops, conferences and journals and can be quickly learnt by researchers who are familiar with HTML.

Research limitations: the evaluation study also highlighted some issues in the adoption of RASH, and in general of HTML formats, especially by less technical savvy users. Moreover, additional tools are needed, e.g. for enabling additional conversion from/to existing formats such as OpenXML.

Practical implications: RASH (and its Framework) is another step towards enabling the definition of formal representations of the meaning of the content of an article, facilitate its automatic discovery, enable its linking to semantically related articles, provide access to data within the article in actionable form, and allow integration of data between papers.

Social implications: RASH addresses the intrinsic needs related to the various users of a scholarly article: researchers (focussing on its content), readers (experiencing new ways for browsing it), citizen scientists (reusing available data formally defined within it through semantic annotations), publishers (using the advantages of new technologies as envisioned by the Semantic Publishing movement).

Value: RASH focuses strictly on writing the content of the paper (i.e., organisation of text + semantic annotations) and leaves all the issues about it validation, visualisation, conversion, and semantic data extraction to the various tools developed within its Framework.

Yewno Announces Partnerships With Top Publishers to Produce Additional Content Discoverable Through Yewno Platform | Business Wire

“Yewno, a provider of a new inference engine that mimics the human brain and increases knowledge discovery, today announced its partnership with top publishers and other research providers including Wiley, Harvard DASH, American Society for Microbiology and BioOne. Content from these distinguished publishers will produce new insights and inferences giving knowledge seekers access to important content across various verticals to enhance discovery….”

Yewno Announces Partnerships With Top Publishers to Produce Additional Content Discoverable Through Yewno Platform | Business Wire

“Yewno, a provider of a new inference engine that mimics the human brain and increases knowledge discovery, today announced its partnership with top publishers and other research providers including Wiley, Harvard DASH, American Society for Microbiology and BioOne. Content from these distinguished publishers will produce new insights and inferences giving knowledge seekers access to important content across various verticals to enhance discovery….”

ODRL Community Group

“The W3C ODRL [Open Digital Rights Language] Community Group’s aim is to develop and promote an open international specification for Policy Language expressions. The ODRL Policy Language provides a flexible and interoperable information model to support transparent and innovative use of digital assets in the publishing, distribution and consumption of content, applications, and services across all sectors and communities. The ODRL Policy model is targeted to support the business models of open, educational, government, and commercial communities through Profiles that enhance the model to align to their requirements whilst providing a common semantic layer for interoperability….”

Improving interoperability using vocabulary linked data – Springer

Abstract:  The concept of Linked Data has been an emerging theme within the computing and digital heritage areas in recent years. The growth and scale of Linked Data has underlined the need for greater commonality in concept referencing, to avoid local redefinition and duplication of reference resources. Achieving domain-wide agreement on common vocabularies would be an unreasonable expectation; however, datasets often already have local vocabulary resources defined, and so the prospects for large-scale interoperability can be substantially improved by creating alignment links from these local vocabularies out to common external reference resources. The ARIADNE project is undertaking large-scale integration of archaeology dataset metadata records, to create a cross-searchable research repository resource. Key to enabling this cross search will be the ‘subject’ metadata originating from multiple data providers, containing terms from multiple multilingual controlled vocabularies. This paper discusses various aspects of vocabulary mapping. Experience from the previous SENESCHAL project in the publication of controlled vocabularies as Linked Open Data is discussed, emphasizing the importance of unique URI identifiers for vocabulary concepts. There is a need to align legacy indexing data to the uniquely defined concepts and examples are discussed of SENESCHAL data alignment work. A case study for the ARIADNE project presents work on mapping between vocabularies, based on the Getty Art and Architecture Thesaurus as a central hub and employing an interactive vocabulary mapping tool developed for the project, which generates SKOS mapping relationships in JSON and other formats. The potential use of such vocabulary mappings to assist cross search over archaeological datasets from different countries is illustrated in a pilot experiment. The results demonstrate the enhanced opportunities for interoperability and cross searching that the approach offers.

Enabling semantics-aware collaborative tagging and social search in an open interoperable tagosphere

Abstract:  To make the most of a global network effect and to search and filter the Long Tail, a collaborative tagging approach to social search should be based on the global activity of tagging, rating and filtering. We take a further step towards this objective by proposing a shared conceptualization of both the activity of tagging and the organization of the tagosphere in which tagging takes place. We also put forward the necessary data standards to interoperate at both data format and semantic levels. We highlight how this conceptualization makes provision for attaching identity and meaning to tags and tag categorization through a Wikipedia-based collaborative framework. Used together, these concepts are a useful and agile means of unambiguously defining terms used during tagging, and of clarifying any vague search terms. This improves search results in terms of recall and precision, and represents an innovative means of semantics-aware collaborative filtering and content ranking.

Evaluating open access journals using Semantic Web technologies and scorecards

Abstract:  This paper describes a process to develop and publish a scorecard from an OAJ (Open Access Journal) on the Semantic Web using Linked Data technologies in such a way that it can be linked to related datasets. Furthermore, methodological guidelines are presented with activities related to each step of the process. The proposed process was applied to a university OAJ, including the definition of the KPIs (Key Performance Indicators) linked to the institutional strategies, the extraction, cleaning and loading of data from the data sources into a data mart, the transformation of data into RDF (Resource Description Framework), and the publication of data by means of a SPARQL endpoint using the Virtuoso software. Additionally, the RDF data cube vocabulary has been used to publish the multidimensional data on the Web. The visualization was made using CubeViz, a faceted browser to present the KPIs in interactive charts.