Jailbreaking the PDF -3; Styles and fonts and the problems from Publishers.

Many scientific publications use specific styling to add semantics. In converting to XML it’s critical we don’t throw these away at an early stage, yet many common tools discard such styles. #AMI2 does its best to preserve all these and I think is fairly good. There are different reasons for using styles and I give examples from OA publishers…

  • Bold – used extensively for headings and inline structuring. Note (a) the bold for the heading and (b) the start-of-line


  • Italic. Species are almost always rendered this way.


  • Monospaced.
    Most computer code is represented in this (abstract) font.


This should have convinced you that fonts, and styles matter and should be retained. But many PDF2xxx systems discard them, especially for scholarly publications. There’s a clear standard in PDF for indicating bold, for italic and PDFBox gives a clear API for this. But many scholarly PDFs are awful (did I mention this before?). The BMC fonts don’t declare they are bold even though they are. Or italic. So we have to use heuristics. If a BMC font has “+20″ after its name it’s probably bold. And +3 means italics.

Isn’t this a fun puzzle?

No. It’s holding science back. Science should be about effective communication. If we are going to use styles rather than proper markup, let’s do it properly. Let’s tell the world it’s bold. Let’s use 65 to mean A.

There are a few cases where an “A” is not an “A”. As in http://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols

Most of these have specific mathematical meanings and uses and most have their own Unicode points. They are not letters in the normal sense of the word – they are symbols. And if they are well created and standard then they are mangeable

But now an unnecessary nuisance from PeerJ (and I’m only using Open Access publishers so I don’t get sued):

What are the blue things? They look like normal characters, but they aren’t:

<text fill=”#00a6fc” svgx:fontName=”MinionExp-Regular” svgx:width=”299.0″

x=”284.784″ y=”162.408″ font-weight=”normal”></text>

<text fill=”#00a6fc” svgx:fontName=”MinionExp-Regular” svgx:width=”299.0″

x=”281.486″ y=”162.408″ font-weight=”normal”></text>

They are weird codepoints, outside the Unicode range:


These two seem to be small-capital “1″ and “0″. They aren’t even valid Unicode characters. Some of our browsers won’t display them:

(Note the missing characters).

Now the DOI is for many people a critically important part of the paper! It’s critical that it is correct and re-usable. But PeerJ (which is a modern publisher and tells us how it has used modern methods to do publishing better and cheaper) seems to have deliberately used totally non-standard characters for DOIs to the extent that my browser can’t even display them. I’m open to correction – but this is barmy. (The raw PDF paper displays in Firefox, but that’s because the font is represented by glyphs rather than codepoints.) No doubt I’ll be told that it’s more important to have beautiful fonts to reduce eyestrains for humans and that corruption doesn’t matter. Most readers don’t even read the references – they simply cut and paste them.

So let’s look at the references:

Here the various components are represented in different fonts and styles. (Of course it would be better to use approaches such as BibJSON and even BibTeX, but that would make it too easy to get it right). So here we have to use fonts and styles to guess what the various bits mean. Bold are the authors, Followed by a period. And bold number for the year. And a title in normal font. A Journal in italics. More Bold for the volume number. Normal for the pages. Light blue is DOI.

But at least if we keep the styles then #AMI2 can hack it. Throwing away the styles makes it much harder and much more error prone.

So to summarise #AMI2=PDF2SVG does things that most other systems don’t do:

  • Manages non-standard fonts (but with human labour)
  • Manages styles
  • Converts to Unicode

AMI2 can’t yet manage raw glyphs, but she will in due time.(Unless YOU wish to volunteer – it actually is a fun machine-learning project).

NOTE: If you are a large commercial publisher then your fonts are just as bad.

Jailbreaking the PDF – 2; Technical aspects (Glyph processing)

A lot of our discussion in Jailbreaking related to technical issues, and this is a – hopefully readable – overview.

PDF is a page description format (does anyone use pages any more? other than publishers and letter writers?) which is designed for sighted humans. At its most basic it transmits a purely visual image of information, which may simply be a bitmap (e.g. a scanned document). That’s currently beyond our ability to automate (but we shall ultimately crack it). More usually it consists of glyphs (http://en.wikipedia.org/wiki/Glyph the visual representation of character). All the following are glyphs for the character “a”.

The minimum that a PDF has to do is to transmit one of these 9 chunks. It can do that by painting black dots (pixels) onto the screen. Humans can make sense of this (they get taught to read but machines can’t. So it really helps when the publisher adds the codepoint for a character. There’s a standard for this – it’s called Unicode and everyone uses it. Correction: MOST people, but NOT scholarly publishers. Many publishers don’t include codepoints at all but transmit the image of the glyph (this is sometimes a bitmap, sometimes a set of strokes (vector/outline fonts)). Here’s a bitmap representation the first “a”.

You can see it’s made of a few hundred pixels (squares). The computer ONLY knows these are squares. It doesn’t know they are an “a”. We shall crack this in the next few months – it’s called Optical Character Recognition OCR and usually done by machine learning – we’ll pool our resources on this. Most characters in figures are probably bitmapped glyphs, but some are vectors.

In the main text characters SHOULD be represented by a codepoint – “a” is Unicode codepoint 97. (Note that “A” is different and codepoint 65 – I’ll use decimal values). So every publishers represent “a” by 97?

Of course not. Publishers PDFs are awful and don’t adhere to standards. That’s a really awful problem. Moreover some publishers use 97 to mean http://en.wikipedia.org/wiki/Alpha . Why?? because in some systems there is a symbol font and it only has Greek characters and they use the same numbers.

So why don’t publishers fix this? It’s because (a) they don’t care and (b) they can extract more money from academia for fixing it. They probably have the correct codepoint in their XML but they don’t let us have this as they want to charge us extra to read it. (That’s another blog post). Because most publishers use the same typesetters these problems are endemic in the industry. Here’s an example. I’m using BioMedCentral examples because they are Open. I have high praise for BMC but not for their technical processing. (BTW I couldn’t show any of this from Closed publishers as I’d probably be sued).

How many characters are there in this? Unless you read the PDF you don’t know. The “BMC Microbiology” LOGO is actually a set of graphics strokes and there is no indication it is actually meaningful text. But I want to concentrate on the “lambda” in the title. Here is AMI2′s extracted SVG/XML (I have included the preceding “e” of “bacteriophage”)

<text stroke=”none” fill=”#000000″ svgx:fontName=”AdvOT46dcae81″

svgx:width=”500.0″ x=”182.691″ y=”165.703″ font-size=”23.305″

font-weight=”normal”>e</text>

<text stroke=”none” fill=”#000000″ svgx:fontName=”AdvTT3f84ef53″

svgx:width=”0.0″ x=”201.703″ y=”165.703″ font-size=”23.305″

font-weight=”normal”>
l</text>

Note there is NO explicit space. We have work it out from the coordinates (182.7 + 0.5*23 << 201.7). But the character 108 is a “l” (ell) and so an automatic conversion system creates

This is wrong and unacceptable and potentially highly dangerous – a MU would be convtered to an “EM”, so Micrograms could be converted to Milligrams.

All the systems we looked at yesterday made this mistake except #AMI2. So almost all scientific content mining systems will extract incorrect information unless they can correct for this. And there are three ways of doing this:

  • Insisting publishers use Unicode. No hope in hell of that. Publishers (BMC and other OA publishers excluded) in general want to make it as hard as possible to interpret PDFs. So nonstandard PDFs are a sort of DRM. (BTW it would cost a few cents per paper to convert to Unicode – that could be afforded out of the 5500 USD they charge us).
  • Translating the glyphs into Unicode. We are going to have to do this anyway, but it will take a little while.
  • Create lookups for each font. So I have had to create a translation table for the non-standard font AdvTT3f84ef53 which AFAIK no one other than BMC uses and isn’t documented anywhere. But I will be partially automating this soon and it’s a finite if soul-destroying task

So AMI2 is able to get:

With the underlying representation of lambda as Unicode 955:

So AMI2 is happy to contribute her translation tables to the Open Jalibreaking community. She’d also like people to contribute, maybe through some #crowdcrafting. It’s pointless for anyone else to do this unless they want to build a standalone competitive system. Because it’s Open they can take AMI2 as long as they acknowledge it in their software. Any system that hopes to do maths is almost certainly going to have to use a translator or OCR.

So glyph processing is the first and essential part of Jailbreaking the PDF.

 

Jailbreaking the PDF; a wonderful hackathon and a community leap forward for freedom – 1

Yesterday we had a truly marvellous hackathon http://scholrev.org/hackathon/ in Montpellier, in between workshops and main Eur Semantic Web Conference. The purpose was to bring together a number of groups who value semantic scholarship and free information from the traditional forms of publication. I’ll be blogging later about the legal constraints imposed by the publishing industry, but Jailbreaking is about the technical constraints of publishing information as PDF.

The idea Jailbreaking was to bring together people who have developed systems, tools, protocols, communities for turning PDF into semantic form. Simply, raw PDF is almost uninterpretable, a bit like binary programs. For about 15 years the spec was not Open and it was basically a proprietary format from Adobe. The normal way of starting to make any sense of PDF content is to buy tools from companies such as Adobe, and there has been quite a lot of recent advocacy from Adobe staff to consider using PDF as a universal data format. This would be appalling – we must use structured documents for data and text and mixtures. Fortunately there are now a good number of F/OSS tools, my choice being http://pdfbox.apache.org/ and these volunteers have laboured long and hard in this primitive technology to create interpreters and libraries. PDF can be produced well, but most scholarly publishers’ PDFs are awful.

It’s a big effort to create a PDF2XML system (the end goal). I am credited with the phrase “turning a hamburger into a cow” but it’s someone else’s. If we sat down to plan PDF2XML, we’d conclude it was very daunting. But we have the modern advantage of distributed enthusiasts. Hacking PDF systems by oneself at 0200 in the morning is painful. Hacking PDFs in the company of similar people is wonderful. The first thing is that it lifts the overall burden from you. You don’t have to boil the ocean by yourself. You find that others are working on the same challenge and that’s enormously liberating. They face the same problems and often solve them in different ways or have different priorities. And that’s the first positive takeaway – I am vastly happier and more relaxed. I have friends and the many “I“s are now we. It’s the same liberating feeling as 7 years ago when we created the http://en.wikipedia.org/wiki/Blue_Obelisk community for chemistry. Jailbreaking has many of the shared values, though coming from different places.

Until recently most of the tools were closed source, usually for-money though occasionally free-as-in-beer for some uses or communities. I have learnt from bitter experience that you can never build an ongoing system on closed source components. At some stage they will either be withdrawn or there will be critical things you want to change or add and that’s simply not possibly. And licensing closed source in an open project is a nightmare. It’s an anticommmons. So, regretfully, I shall not include Utopia/pdfx from Manchester in my further discussion because I can’t make any use of it. Some people use its output, and that’s fine – but I would/might want to use some of its libraries.

There was a wonderful coming-together of people with open systems. None of us had the whole picture , but together we covered all of it. Not “my program is better than your program”, but “our tools are better than my system“. So here a brief overview of the open players who came together (I may miss some individuals, please comment if I have done you an injustice). I’ll explain the technical bits is a later post – here I am discussing the social aspects.

  • LA-PDFText (http://code.google.com/p/lapdftext/
    Gully Burns). Gully was in Los Angeles – in the middle of the night and showed great stamina J In true hacking spirit I used the time to find out about Gully’s system. I downloaded it and couldn’t get it to install (needed java-6). So Gully repackaged it, and within two iterations (an hour) I had it working. That would have taken days conventionally. LA-PDFText is particularly good at discovering blocks (more sophisticated than #AMI2) so maybe I can use it in my work rather than competing.
  • CERMINE
    http://sciencesoft.web.cern.ch/node/120 . I’ve already blogged this but here we had the lead Dominika Tkaczyk live from Poland. I take comfort from her presence and vice versa. CERMINE integrates text better than #AMI at present and has a nice web service
  • Florida State University. Alexander Garcia, Casey McLaughlin, Leyla Jael Garcia Castro, Biotea (http://biotea.idiginfo.org/ ) Greg Riccardi and colleagues. They are working on suicide in the context of Veterans’ admin documents and provided us with an Open corpus of many hundred PDFs. (Some were good, some were really awful). Alex and Casey ran the workshop with great energy, preparation, food, beer, etc. and arranging the great support from the ABES site.
  • #crowdcrafting. It will become clear that human involvement is necessary in parts of the PDF2XML process. Validating or processes, and also possible tweaking final outputs. We connected to Daniel Lombraña González

    of http://crowdcrafting.org/ who took us through the process of building a distributed volunteer community. There was a lot of interest and we shall be designing clear crowdcrafting-friendly tasks (e.g. “draw a rectangle round the title”, “highlight corrupted characters”, “how many references are there”, etc.)

  • CITALO
    http://wit.istc.cnr.it:8080/tools/citalo. This system deduces the type of the citation (reference) from textual analysis. This is a very good example of a downstream application which depends on the XML but is largely independent of how it is created.
  • #AMI2. Our AMI2 system is complementary to many of the others – I am very happy for others to do citation typing, or match keywords. AMI2 has several unique features (I’ll explain later), including character identification, graphics (graphics are not images) extraction, image extraction, sub and superscripts, bold and italic. (Most of the other systems ignore graphics completely and many also ignore bold/italic)

So we have a wonderful synthesis of people and projects and tools. We all want to collaborate and are all happy to put community success as the goal , not individual competition. (And the exciting thing is that it’s publishable and will be heavily cited. We have shown this in the Blue Obelisk publications where the first has 300 citations and I’d predict that a coherent Jailbreaking publication would be of great interest. )

So yesterday was a turning point. We have clear trajectories. We have to work to make sure we develop rapidly and efficiently. But we can do this initially in a loose collaboration, and planning meetings and bringing in other collaborators and funding.

So if you are interested in An Open approach to making PDFs Open and semantic, let us know in the comments.

 

Pre Green-OA Fool’s Gold vs. Post Green-OA Fair Gold

Comment on Richard Poynder’s “The UK?s Open Access Policy: Controversy Continues“:

Yes, the Finch/RCUK policy has had its predictable perverse effects:

1. sustaining arbitrary, bloated Gold OA fees
2. wasting scarce research funds
3. double-paying publishers [subscriptions plus Gold]
4. handing subscription publishers a hybrid-gold-mine
5. enabling hybrid publishers to double-dip
6. abrogating authors’ freedom of journal-choice [economic model/CC-BY instead of quality]
7. imposing re-mix licenses that many authors don’t want and most users and fields don’t need
8. inspiring subscription publishers to adopt and lengthen Green OA embargoes [to maxmize hybrid-gold revenues]
9. handicapping Green OA mandates worldwide (by incentivizing embargoes)
10. allowing journal-fleet publishers to confuse and exploit institutions and authors even more

But the solution is also there (as already adopted in Francophone Belgium and proposed by HEFCE for REF):

a. funders and institutions mandate immediate-deposit
b. of the peer-reviewed final draft
c. in the author’s institutional repository
d. immediately upon acceptance for publication
e. whether journal is subscription orGold
f. whether access to the deposit is immedate-OA or embargoed
g. whether license is transfered, retained or CC-BY;
h. institutions implement repository’s facilitated email eprint request Button;
i. institutions designate immediate-deposit the mechanism for submitting publictions for research performance assessment;
j. institutions monitor and ensure immediate-deposit mandate compliance

This policy restores author choice, moots publisher embargoes, makes Gold and CC-BY completely optional, provides the incentive for author compliance and the natural institutional mechanism for verifying it, consolidates funder and institutional mandates, hsstens the natural death of OA embargoes, the onset of universal Green OA, and the resultant institutional subscription cancellations, journal downsizing and transition to Fair-Gold OA at an affordable, sustainable price, paid out of institutional subscription cancellation savings instead of over-priced, double-paid, double-dipped Fool’s-Gold. And of course Fair-Gold OA will license all the re-use rights users need and authors want to allow.

SePublica : Overview of my Polemics presentation #scholrev

This is a list of the points I want to cover when introducing the session on Polemics. A list looks a bit dry but I promise to be polemical. And try to show some demos at the end. The polemics are constructive in that I shall suggest how we can change the #scholpub world by building a better one than the current one.

NOTE: Do not be overwhelmed by the scale of this. Together we can do it.

It is critical we act now

  • Semantics/Mining is now seen as an opportunity by some publishers to “add value” by building walled gardens.
  • Increasing attempts to convince authors to use CC-NC.
  • We must develop semantic resources ahead of this and push the edges

One person can change the world

We must create a coherent community

  • Examples:
    • Open Streetmap,
    • Wikipedia
    • Galaxyzoo
    • OKFN Crowdcrafting,
    • Blue Obelisk (Chemistry – PMR),
    • ?#scholrev

Visions

  • Give power to authors
  • Discover, aggregate and search (“Google for science”)
  • Make the literature computable
  • Enhance readers with semantic aids
  • Smart “invisible” capture of information

Practice before Politics

  • Create compelling examples
  • Add Value
  • Make authors’ lives easier
  • Mine and semanticize current scholarship.

Text Tables Diagrams Data

  • Text (chemistry, species)
  • Tables (Jailbreak corpus)
  • Diagrams chemical spectra, phylogenetic trees
  • Data (output). Quixote

Material to start with

  • Open information (EuropePMC, theses)
  • “data not copyrightable”. Supp data, tables, data-rich diagrams
  • Push the limits of what’s allowed (forgiveness not permission)

Disciplines/artefacts with good effort/return ratio

  • Phylogenetic trees (Ross Mounce + PMR)
  • Nucleic acid sequences
  • Chemical formulae and reactions
  • Regressions and models
  • Clinical/human studies (tables)
  • Dose-response curves

Tools, services, resources

    We need a single-stop location for tools

  • Research-enhancing tools (science equiv of Git/Mercurial). Capture and validate work continuously
  • Common approach to authoring
  • Crawling tools for articles, theses.
  • PDF and Word converters to “XML”
  • Classifiers
  • NLP tools and examples
  • Table hackers
  • Diagram hackers
  • Logfile hackers
  • Semantic repositories
  • Abbreviations and glossaries
  • Dictionaries and dictionary builders

     

Advocacy, helpers, allies

  • Bodies who may be interested (speculative, I haven’t asked them):
    • Funders of science
    • major Open publishers
    • Funders of social change (Mellon, Sloane, OSF…)
    • SPARC, DOAJ, etc.
    • (Europe)PMC
  • Crowdcrafting (OKF, am involved with this)
  • Wikipedia

 

 

 

 

SePublica: Making the scholarly literature semantic and reusable

Scholarly literature has been virtually untouched by the digital revolution in this century. The primary communication is by digital copies of paper (PDFs) and there is little sign that it has brought any change in social structures either in Universities/Research_Establishments or in the publishing industry. The bulk of this industry comprises two sectors, commercial publishing and learned societies. The innovations have been largely restricted to Open Access publishing (pioneered by BMC and then by PLoS) and the megajournal (PLoSOne).

I shall generalise, and exempt a few players from criticism: The Open Access publishers above with smaller ones such as eLife, PeerJ, MDPI, Ubiquity, etc. And a few learned societies (the International Union of Crystallography and the European Geosciences Union, and please let me have more). But in general the traditional publishers (all those not exempted) are a serious part of the problem and cannot now be part of the solution.

That’s a strong statement. But over the last ten years it has been clear that publishing should change, and it hasn’t. The mainstream publishers have put energy into stopping information being disseminated and creating restrictions on how it can be used. Elsevier (documented on this list) has prevented me extracting semantic information from “their” content.

The market is broken because the primary impetus to publish is increasingly driven by academic recognition rather than a desire to communicate. And this makes it impossible for publishers to act as partners in the process of creating semantics. I hear that one large publisher has now built a walled garden for content mining – you have to pay to access it and undoubtedly there are stringent conditions on its re-use. This isn’t semantic progress, it’s digital neo-colonialism.

I believe that semantics arises out of community practice of the discipline. On Saturday the OKFN is having an economics hackathon (Metametrik) in London where we are taking five papers and aiming to build a semantic model. It might be in RDF, it might be in XML; the overriding principle is that it must be Open, developed in a community process.

And in most disciplines this is actively resisted by the publishing community. When Wikipedia started to use Chemical Abstracts (ACS) identifiers the ACS threated Wikipedia with legal action. They backed down under community pressure. But this is no way for semantic development. It can only lead to centralised control of information. Sometimes top-down semantic development is valuable (probably essential in heavily regulated fields) but it is slow , often arbitrary and often badly engineered.

We need the freedom to use the current literature and current data as our guide to creating semantics. What authors write is, in part, what they want to communicate (although the restrictions of “10 pages” is often absurd and destroys clarity and innovation). The human language contains implicit semantics, which are often much more complex that. So Metametrik will formalize the semantics of (a subset of) economic models, many of which are based on OLS (ordinary least squares). Here’s part of a typical table reporting results. It’s data so I am not asking permission to reproduce it. [It’s an appalling reflection on the publication process that I should even have to, though many people are more frightened of copyright that of doing incomplete science.]

 

And the legend:

How do we represent this table semantically? We have to identify its structure, and the individual components. The components are, for the most part well annotated in a large metadata table. (And BTW metadata is essential for reporting facts so I hope no one argues that it’s copyrightable. If they do, then scientific data in C21 is effectively paralysed.)

That’s good metadata for 2001 when the paper was published. Today, however , we immediately feel the frustration of not linking *instantly* to Gallup and Sachs, or La Porta. And we seethe with rage if we find that they are paywalled and this is scholarly vandalism – preventing the proper interpretation of scholarship.

We then need a framework for representing the data items. Real (FP) numbers, with errors and units. There doesn’t seem to be a clear ontology/markup for this, so we may have to reuse from elsewhere. We have done this in Chemical Markup Language (its STMML subset) which is fully capable of holding everything in the table. But there may be other solutions –please tell us.

But the key point is that the “Table” is not a table. It’s a list of regression results where the list runs across the page. Effectively its regression1, … regression 11. So a List is probably more suitable than a table. I shall have a hack at making this fully semantic and recomputable.

And at the same time seeing if AMI2 can actually read the table from the PDF.

I think this is a major way of kickstarting semantic scholarship – reading the existing literature and building re-usables from it. Let’s call it “Reusable scholarship”.

 

 

SePublica: What we must do to promote Semantics #scholrev #btpdf2

In the previous post (http://blogs.ch.cam.ac.uk/pmr/2013/05/23/sepublica-how-semantics-can-empower-us-scholrev-scholpub-btpdf2/) I outlined some of the reasons why semantics are so important. Here I want to show what we have to do (and again stick with me – although you might disagree with my stance).

The absolute essentials are:

  • We have to be a community.
  • We have to identify things that can be described and on which we are prepared to agree.
  • We have to describe them
  • We have to name them
  • We have to be able to find them (addressing)

Here Lewis Carroll, a master of semantics shows the basics

And she went on planning to herself how she would manage it. `They must go by the carrier,’ she thought; `and how funny it’ll seem, sending presents to one’s own feet! And how odd the directions will look!

ALICE’S RIGHT FOOT, ESQ.

HEARTHRUG,

NEAR THE FENDER,

(WITH ALICE’S LOVE).

 

Oh dear, what nonsense I’m talking!’

Alice identifies her foot as a foot, and makes gives it a unique identifier RIGHT FOOT. The address consists of another unique identifier (HEARTHRUG) and annotates it (NEAR THE FENDER). There’s something fundamental about this – (How many children have annotated their books with “Jane Doe, 123 Some Road, This Town, That City, Country, Continent, Earth, Solar System, Universe?). Hierarchies seem fundamental to humans. Anything else is much more difficult. (Peter Buneman and I have been bouncing this idea about). I am sure we have to use hierarchies to promote these ideas to newcomers.

Things get unique identifiers. They can be at different levels. Single instances such as Alice’s left foot.

But there are also whole classes – the class of left feet. I have a left foot. It’s distinct from Alice’s. And we need unique names for these classes, such as “left foot“. Generally all humans have one (but see http://en.wikipedia.org/wiki/The_Man_with_Two_Left_Feet ). And we can start making rules, see http://human-phenotype-ontology.org/contao/index.php/hpo_docu.html.

At the moment, all relationships in the Human Phenotype Ontology are is_a relationships,  i.e.  a simple class-subclass relationships. For instance, Abnormality of the feet
is_a
Abnormality of the lower limbs. The relationships are transitive, meaning that they are inherited up all paths to the root. For instance,
Abnormality of the lower limbs is_a
Abnormality of the extremities, and thus Abnormality of the feet also is Abnormality of the extremities.

We see a terminology appearing. Some would call this an ontology, others would refute this. I tend to use the concept of “dictionary” fuzzed across language and computability.

This is where the difficulties start. One the one hand this is very valuable – if a disease affects the extremities, then it might affect the left foot. But it’s also where people’s eyes glaze over. Ontology language is formal and does not come naturally to many of us. And when it’s applied like a syllogism:

  • All men are mortal
  • Socrates is a man
  • Therefore Socrates is mortal

Many people think – so what? – we knew that already. On the other hand it’s quite difficult to translate this into machine language (even after realising that “men” is mans (the plural). The symbology is frightening (with upside down A’s and backwards E’s). Here are fundamental concepts in a type system: http://stackoverflow.com/questions/12532552/what-part-of-milner-hindley-do-you-not-understand :

The discussion on Stack Overflow includes:

  • “Actually, HM is surprisingly simple–far simpler than I thought it would be. That’s one of the reasons it’s so magical”
  • “The 6 rules are very easy. Var rule is rather trivial rule – it says that if type for identifier is already present in your type environment, then to infer the type you just take it from the environment as is. PMR is still struggling with the explanation
  • This syntax, while it may look complicated, is actually fairly simple. The basic idea comes from logic: the whole expression is an implication with the top half being the assumptions and the bottom half being the result. That is, if you know that the top expressions are true, you can conclude that the bottom expressions are true as well.

The problem is language and symbology. If you haven’t been trained in language it’s often impenetrable. For example music. If you haven’t been trained in it, it makes little sense and takes us a considerable time to learn:

So if we want to get a lot of people involved we have to be very careful about exposing newcomers to formal semantics. I avoid words like ontology, quantifier, predicate, disjunction, because people already have to be convinced they are worth learning.

Humans want to learn music not because they’ve seen written music but because they’ve heard music. Similarly we have to sell semantics by what it does, rather than what it is. And we cannot show what it does without building systems, any more than we are motivated to learn about pianos until we have seen and heard one.

The problem is that it’s a lot of effort to build a semantic system and that there is not necessarily a clear reward. The initial work, as always, was in computer science which showed – on paper – what could be possible but didn’t leave anything that ordinary people can pick up on. This is very common – before the WWW was a whole decade or more of publications in “hypermedia” but much of this was only read by people working in the field. And often the major reason for working in a new field is to get academic publications, not to create something useful to the world. There often seems to be a lag of twenty years and indeed that’s happening in semantics.

So it’s very difficult to get public funding to build something that’s useful and works. One effect is that the systems are built by companies. That’s not necessarily a bad thing – railways and telephones came from private enterprise. But there are problems with the digital age and we see this with modern phones – they can become monopolies which constrain our freedom. We buy them to communicate but we didn’t buy them to report our location to unknown vested interests.

And semantics have the same problem. The people who control our semantics will control our lives. Because semantics constrain the formal language we use and that may constrain the natural language. We humans may not yet be in danger of Orwell’s Newspeak but our machines will be. And therefore we have to assert rights to have say over our machines’ semantics.

That raises the other problem – semantic Babel. If everyone creates their own semantics no-one can talk (we already see this with phone apps). I live in the semantic Babel of machine-chemistry – every company creates a different approach. Result – chemistry is 20 years behind bioscience where there is a communal vision of interoperable semantics.

So I think the major task for SePublica is to devise a strategy for bottom-up Open semantics. That’s what Gene Ontology did for bioscience. We need to identify the common tools and the common sources of semantic material. And it will be slow – it took crystallography 15 years to create their dictionaries and system and although we are speeding up we’ll need several years even when the community is well inclined. (That’s what we are starting to do in computational chemistry – the easiest semantic area of any discipline). It has to be Open, and we have to convince important players (stakeholders) that it matters to them. Each area will be different. But here are some components that are likely to be common to almost all fields:

  • Tools for creating and maintaining dictionaries
  • Ways to extract information from raw sources (articles, papers, etc.) – that’s why we are Jailbreaking the PDF.
  • Getting authorities involved (but this is increasingly hard as the learned societies are often our problem , not the solution)
  • Tools to build and encourage communities
  • Demonstrators and evangelists
  • Stores for our semantic resources
  • Working with funders

We won’t get all of that done at SePublica. But we can make a lot of progress.

 

 

 

 

 

SePublica: How semantics can empower us; #scholrev #scholpub #btpdf2

I’m writing blog posts to collect my thoughts for the wonderful workshop at SePublica http://sepublica.mywikipaper.org/drupal/ where I am leading off the day. [This also acts as a permanent record instead of slides. Indeed I may not provide slides as such as I often create the talk as I present it.] My working title is

Why and how can we make Scholarship Semantic?

[If you switch off at “Semantics” trust me and keep reading… There’s a lot here about changing the world.]

Why should we strive to create a semantics web/world? I “got it” when I head TimBL in 1994. Many people have “got it”. There are startups based on creating and deploying semantic technology. My colleague Nico Adams (who understands much more about the practice of semantics than me) has a vision of creating a reasoning engine for science (he’s applied this to polymers, biotechnology, chemistry). I completely buy his vision.

But it’s hard to sell this to people who don’t understand. Any more than TimBL could sell SGML in 1990. (Yes there were whole industries who bought into SGML, but most didn’t). So what TimBL did was to build a system that worked (The WWW). And this often seems to be the requirement for Semantic Web projects. Build it and show it working.

SePublica will probably be attended by the converted. I don’t think I have to convince them of the value of semantics. But I do have to catalyse:

  • The creation of convincing demonstrators (examples that work)
  • Arguments for why we need semantics and what it can do.

So why are semantics important for scholarly publishing ? The following arguments will hopefully convince some people:

  • They unlock the value of the stuff already being published. There is a great deal in a single PDF (article or thesis) that is useful. Diagrams and tables are raw exciting resources. Mathematical equations. Chemical structures. Even using what we have today converted into semantic form would add billions.
  • They make information and knowledge available to a wider range of people. If I read a paper with a term I don’t know then semantic annotation may make it immediately understandable. What’s rhinovirus? It’s not a virus of rhinoceroses – it’s the common cold. That makes it accessible to many more people (if the publishers allow it).
  • They highlight errors and inconsistencies. Ranging from spelling errors to bad or missing units to incorrect values to stuff which doesn’t agree with previous knowledge. And machines can do much of this. We cannot have reproducible science until we have semantics.
  • They allow the literature to be computed. Many of thre semantics define objects (such as molecules or phylogenetic trees) which are recomputable. Does the use of newer methods give the same answer?
  • They allow the literature to be aggregated. This is one of the most obvious benefits. If I want all phylogenetic trees, I need semantics – I don’t want shoe-trees or B-trees or beech trees. And many of these concepts are not in Google’s public face (I am sure they have huge semantics internally)
  • They allow the material to be searched. How many chemists use halogenated solvents. (The word halogen will not occur in the paper). With semantics this is a relatively easy thing to do. Can you find second-order differential equations? Or Fourier series? Or triclinic crystals? (The words won’t help) AMI2 will be able to.
  • They allow the material to linked into more complex concepts. By creating a data base of species , a database of geolocations and links between them we start to generate an index of biodiversity. What species have been reported when and where? This can be used for longitudinal analyses – is X increasing/decreasing with time? Where is Y now being reported for the first time?
  • They allow humans to link up. If A is working on Puffinus Puffinus (no, it’s not a Puffin, that’s Fratercula Artica) in the northern hemisphere and B is working on Puffinus
    tenuirostris in Port Fairy Victoria AU then a shared knowledgebase will help to bring the humans together. And that happens between subjects – microscopy can link with molecular biology with climate with chemistry.

In simple terms semantics allow smart humans to develop communal resources to develop new ideas faster, smarter and better.

Please add other ideas! I am sure I have missed some.

 

Food and Energy Security Publishes Issue 2.1

Food and Energy SecurityFood and Energy Security is a new high quality open access journal publishing high impact original research on agricultural crop and forest productivity to improve food and energy security. We are delighted by the high level of readership which our first two issues received and we would like to inform you that Issue 2.1 of this journal has now been published and is free for all to read, download and share.

Highlights from this issue include:

purple_lock_open Food and thriving people: paradigm shifts for fair and sustainable food systems
by Geoff Tansey
Summary: This article looks beyond the physical sciences to address the problems of hunger, malnutrition, and environmental degradation. It discusses the challenges and problems with global food security and where and why paradigm shifts are needed to meet those challenges in a fair and sustainable way.

purple_lock_open Biomass properties from different Miscanthus species
by Chenchen Liu, Liang Xiao, Jianxiong Jiang, Wangxia Wang, Feng Gu, Dongliang Song, Zili Yi, Yongcan Jin and Laigeng Li
Summary: Miscanthus has been considered a potential energy crop for lignocellulosic biomass production. Four Miscanthus species widely distributed in China were assessed for their biomass production, chemical composition, and saccharification efficiency.
 
purple_lock_open Prospects of doubling global wheat yields
by Malcolm J. Hawkesford, Jose-Luis Araus, Robert Park, Daniel Calderini, Daniel Miralles, Tianmin Shen, Jianping Zhang and Martin A. J. Parry
Summary: Whilst an adequate supply of food can be achieved at present for the current global population, sustaining this into the future will be difficult in the face of a steadily increasing population. Wheat alone provides ?20% of the calories and the protein for the world’s population, and the value and need to increase the production is recognized widely.

If you enjoy reading these articles then why not submit your paper to Food and Energy Security? You can submit via our online submission site >

Don’t miss any of the papers as they publish. Sign up for content alerts here >

#scholrev #ami2 #btpdf2 Jailbreaking content (including tables) from PDFs

We’ve got a splendid collection of about 600 Open PDFs for our jailbreak hackathon. They seem to have a medical focus. They are of very variable type and quality. Some are reports, guidelines , some academic papers. Some are born digital but at least one is scanned OCR where the image and the text are superposed. (BTW I am taking it on trust that the papers are Open – some are from closed access publishers and carry their copyright. It’s time we starting marking papers as Open ON THE PAPER).

I have given these to #AMI2 – she processes a paper in about 10 secs on my laptop so it’s just over an hour for the whole lot. That gives me a chance to blog some more. In rev63 AMI was able to do tables so here, without any real selection, I’m giving some examples. (Note that some tables are not recognised as such – especially when the authors don’t use the word “table”. But we shall hack those in time…). Also, as HTML doesn’t seem to have a tableFooter that manages the footnotes I have temporarily added this to the caption as a separate paragraph

From Croat Med J. 2007;48:133-9:

The table in the PDF

 

AMI’s translation to HTML:

Table 1. Scores achieved by 151 Croatian war veterans diagnosed with posttraumatic stress disorder on the Questionnaire on Traumatic Combat and War Experiences (USTBI-M), Mississippi Scale for Combat-Related Post-Traumatic Stress Disorder (M-PTSD), and Minnesota Multiphasic Personality Inventory (MMPI)-201 (presented as T values)

*Abbreviations: L – rigidity in respondents’ approach to the test material; F – lack of understanding of the material; K – tendency to provide socially acceptable answers.

 

Score

 

Questionnaire

(mean ± standard deviation)

Cut-off score

USTBI-M

77.8 ± 14.3

Maximum: 120

M-PTSD

122.1 ± 22.9

107

MMPI-201 scales*

   

L

51.1 ± 2.0

70

F

73.2 ± 6.3

70

K

42.4 ± 3.2

70

 

87.6 ± 5.1

70

 

96.7 ± 6.6

70

 

88.2 ± 4.7

70

 

67.3 ± 4.8

70

     
 

79.3 ± 5.8

70

Pt ( psychastenia )

75.4 ± 5.7

70

 

72.1 ± 7.4

70

 

52.3 ± 2.6

70

 

COMMENT: Some of the row labels/ headings are omitted, but I think that can be solved. (Remember this is AMI’s first attempt so we call it alpha)

Here’s another:



And what AMI translates it to

Table 2 The comparison of quality of life among study groups using analysis of variance and post-hoc tests

*Group-by-group comparisons that were significant at the level of P < 0.001 performed using LSD (homogenous variance; used for physical and overall quality of life) or Dunnet T3 (unhomogenous variance; all other questions). The significance was set at P < 0.001 in post-hoc test in order to reduce the increased chances of false positive results.

QOL dimension/status

Groups

N

Mean ± SD

F; P

Post-hoc differences*

Physical

PTSD + LBP (I)

79

75.44 ± 11.33

   
 

PTSD (II)

56

78.43 ± 11.54

49.18;

I-III, I-IV, II-III,

 

LBP (III)

84

87.43 ± 13.84

< 0.001

II-IV, III-IV

 

Controls (IV)

134

94.42 ± 11.65

   
 

Total

353

85.97 ± 14.40

   

Psychological

PTSD + LBP (I)

76

63.74 ± 14.60

   
 

PTSD (II)

58

67.45 ± 15.92

79.05;

I-III, I-IV, II-III,

 

LBP (III)

90

80.27 ± 14.59

< 0.001

II-IV, III-IV

 

Controls (IV)

132

90.67 ± 10.76

   
 

Total

356

78.51 ± 17.44

   

Social

PTSD + LBP (I)

80

33.40 ± 8.89

   
 

PTSD (II)

58

35.93 ± 9.98

70.19;

I-III, I-IV, II-III,

 

LBP (III)

91

41.58 ± 8.78

< 0.001

II-IV, III-IV

 

Controls (IV)

134

49.22 ± 7.13

   
 

Total

363

41.70 ± 10.6

   

Enviromental

PTSD + LBP (I)

79

92.81 ± 20.78

   
 

PTSD (II)

58

100.76 ± 19.79

66.27;

I-III, I-IV, II-IV,

 

LBP (III)

88

108.36 ± 17.71

< 0.001

III-IV

 

Controls (IV)

130

126.06 ± 14.27

   
 

Total

355

110.14 ± 22.02

   

Satisfaction with personal health status

PTSD + LBP (I)

80

1.84 ± 0.74

   
 

PTSD (II)

59

2.36 ± 0.85

127.48;

I-II, I-III, I-IV, II-IV,

 

LBP (III)

95

2.70 ± 0.98

< 0.001

III-IV

 

Controls (IV)

135

4.03 ± 0.85

   
 

Total

369

2.94 ± 1.23

   

Overall self-reported quality of life

PTSD + LBP (I)

73

2.82 ± 1.14

   
 

PTSD (II)

49

3.29 ± 1.28

24.04;

I-II, I-III, I-IV, II-III,

 

LBP (III)

75

4.04 ± 1.25

< 0.001

II-IV

 

Controls (IV)

42

4.48 ± 0.80

   
 

Total

239

3.59 ± 1.31

   

 

I think she’s got it completely right (the typos “Enviromental” and “Unhomogenous” are visible in the PDF).

AFAIK there is no automatic Open extractor of tables so we are very happy to contribute this to the public pool.