World Opera, Collaborative Science, and Getting On The One

(blows off the dust since the last entry)

(Life trumped blogging; my first child was born in March)

Just before I went into the parent tunnel, which is awesome by the by, I attended a seminar conducted by Niels Windfeld Lund, General Manager of the World Opera.

Not my usual event. But music’s always been a passion for me, and I performed a lot as a kid – lots of trumpet, both the sort of american wind orchestra stuff (seated and marching…yes, a band geek) and some jazz, a little bit of drums. These days I plink around on an acoustic bass, badly, but well enough that I’ll be able to sing lullabyes to my newborn. Starting to play again made me realize how much of music is a conversation, just like science is a conversation (well, an argument). And so this world opera thing seemed like an interesting way to come at the problem from a field that is semantically a world away from science, but in design space is remarkably similar.

This was the second of two recent musical events for me that bear on open collaborative science. The first one we can draw a lesson from is the World Choir. It’s a collaborative choir of more than 2000 people and got tons of press about how it was a collaboration breakthrough, and was designed as an asynchronous request for videos, with a ton of post-processing to stitch them into a single video.

Then there’s the World Opera, which is all about actually performing an opera live with the performers in multiple cities around the world (it needs better marketing help – the first sentence in the website is “You may not have heard about World Opera”). There’s tons of dimensionality baked into that idea from the start. Niels got it funded in his northern Norwegian home of Tromsø after first pitching telemedicine, which has the same fundamental requirements as online opera: big fiber, low latency, great audio/video capability, and the ability to do meaningful real time interaction with remote sites. Surgery or string rehearsal, you have to be able to replicate an intense real life experience. The government apparently preferred opera, which is both wonderful and improbable to me.

They wrestled, or are wrestling, with technical and existential questions. How to use the inevitable delay between performers, turning it into something like the acoustics of a conference hall. Whether or not to use a live conductor as a distinct part of the performance, or to use a metronome, or something in between. Basic stuff, like how to practice together but apart. They are able to do it, but it takes much more work than it would in a regular opera.

It’s hard to get a group on the one. It’s hard when you’re all in the same place. That’s why good labs or departments (or startups) have regular journal clubs, regular lunch sessions, coffee machines that require a fair amount of time to prepare a drink. It helps create that extra time where the individuals involved fall into a rhythm together. Eric Schadt has called it the “clock gene” of a good lab. And it’s been hard to virtually create in the sciences.

My gut is that we have the two musical performances mixed up. A lot of what we mean by open science is the choir:we’ll do crowdsourced data collection, we’ll see a surge of data from impassioned observers into online groups like Sage Bionetworks, but that data will have to be painstakingly synced and organized before we get a beautiful model.
Real collaborative science is going to be hard, like the opera, because it’ll be hard to get on the one. Big questions, both technical and epistemic, have to get answered.

Collaborative opera is totally disruptive to regular opera. It will be resisted, its flaws will be evident with no post-processing to make it shiny. It’s not just sound, it’s a story, it’s acting, it’s interaction between the performers themselves and the audience. It’s going to suck compared to a purist’s opera – at first.

But as the group learns, and they will, it’ll suck a lot less. Then it’ll be really, really good. Incremental innovation will smooth a ton of edges. The performers will figure out if they want an avatar or a conductor. They’ll get used to the latency, and “hear” it when they play together. There’ll be a disrupting tech, probably made by a frustrated musician, that makes some vital but boring process suddenly either a) easier or b) stable or both. Online opera will become a vital part of opera.

These problems, this inherent resistance (in the electrical sense, not the political or incentive sense) is the sort of thing we have to get used to in open science. We can run a bunch of virtual choirs – that’s what 23andme is doing, and I’m a customer. But our infrastructure, and our design thinking, and most of all our expectation, has to support opera, because it is, like science, hard.

Read the comments on this post…

Documents and Data…

Last month I was on Dr. Kiki’s Science Hour. Besides being a lot of fun (despite my technical problems, which were part of my recent move to GNU/Linux and away from Mac!), I also discovered that at least one person I went to high school with is a fan of Dr. Kiki, because he told everyone about the show at my recent high school reunion. Good stuff.

In the show, I did my usual rant about the web being built for documents, not for data. And that got me a great question by email. I wrote a long answer that I decided was a better blog post than anything else. Here goes.

Although I’m familiar with the Creative Commons & Science Commons, the interview really help me understand the bigger picture of the work you do. Among many other significant and timely anecdotes, I received the message that the internet is built around document search and not data search. This comment intrigued me immensely. I want to explore that a little more to understand exactly what you meant. Most importantly, I want to understand what you believe the key differences between the documents and the data are. From one perspective, the documents contain the data, from another, the data forms the documents.

True, in some cases. But in the case of complex adaptive systems – like the body, the climate, or our national energy usage – the data are frequently not part of a document. They exist in massive databases which are loosely coupled, and are accessed by humans not through search engines but through large-scale computational models. There are so many layers of abstraction between user and data that it’s often hard to know where the actual data at the base of a model reside.

This is at odds with the fundamental nature of the Web. The Web is a web of documents. Those documents are all formatted the same way, using a standard markup language, and the same protocol to send copies of those documents around. Because the language allows for “links” between documents, we can navigate the Web of documents by linking and clicking.

There’s more fundamental stuff to think about. Because the right to link is granted to creators of web pages, we get lots of links. And because we get lots of links (and there aren’t fundamental restrictions on copying the web pages) we get innovative companies like Google that index the links and rank web pages, higher or lower, based on the number of links referring to those pages. Google doesn’t know, in any semantic sense, what the pages are about, what they mean. It simply has the power to do clustering and ranking at a scale never before achieved, and that turns out to be good enough.

But in the data world, very little of this applies. The data exist in a world almost without links. There is no accepted standard language, though some are emerging, to mark up data. And if you had that, then all you get is another problem – the problem of semantics and meaning. So far at least, the statistics aren’t good enough to help us really structure data the way they structure documents.

From what you posited and the examples you gave, I envision a search engine which has the capacity to form documents out of data using search terms, e.g. enter two variables and get a graph as a result instead of page results. Not too far from what ‘Wolfram Alpha’ is working on, but indexing all the data rather than pre-tabulated information from a single server/provider. Perhaps I’m close but I want to make sure we’re on the same sheet of music.

I’m actually hoping for some far more basic stuff. I am less worried about graphing and documents. If you’re at that level, you’ve a) already found the data you need and b) know what questions you want to ask about it.

This is the world in which one group of open data advocates live. It’s the world of apps that help you catch the bus in Boston. It’s one that doesn’t worry much about data integration, or data interoperability, because it’s simple data – where is the bus and how fast is it going? – and because it’s mapped against a grid we understand, which is…well, a map.

But the world I live in isn’t so simple. Doing deeply complex modeling of climate events, of energy usage, of cancer progression – these are not so easy to turn into iPhone apps. The way we treat them shouldn’t be with the output of a document. It’s the wrong metaphor. We don’t need a “map” of cancer – we need a model that tells us, given certain inputs, what our decision matrix looks like.

I didn’t really get this myself until we started playing around with massive-scale data integration at Creative Commons. But since then, in addition to what we do here, I’ve been to the NCBI, I’ve been to Oak Ridge National Lab, I’ve been to CERN…and the data systems they maintain are monstrous. They’re not going to be copied and maintained elsewhere, at least, not without lots of funding. They’re not “webby” like mapping projects are. There’s not a lot of hackers who can use them, nor is there a vast toolset to use.

So I guess I’m less interested in search engines for data than I am in making sure that people who are building the models can use crawlers to find the data they want, and that they can be legally allowed to harvest that data and integrate it. Doing so is not going to be easy. But if we don’t design for that world, for model-driven access, then harvest and integration will quickly approach NP levels of complexity. We cannot assume that the tools and systems that let us catch the bus will let us cure cancer. They may, someday, evolve into a common system, and I hope they do – but for now, the iphone approach is using a slingshot against an armored division.

Read the comments on this post…

Marking and Tagging the Public Domain

I am cribbing significant amounts of this post from a Creative Commons blogpost about tagging the public domain. Attribution is to Diane Peters for the stuff I’ve incorporated 🙂

The big news is that, 18 months since we launched CC0 1.0, our public domain waiver that allows rights holders to place a work as nearly as possible into the public domain, worldwide…it’s been a success. CC0 has proven a valuable tool for governments, scientists, data providers, providers of bibliographic data, and many others throughout world. CC0 has been used by the pharmaceutical industry giant GSK as well as by the emerging open data leader Sage Bionetworks (disclosure – I’m on the Board of Sage – though not of GSK!).

At the time we published CC0, we made note of a second public domain tool under development — a tool that would make it easy for people to tag and find content already in the public domain. That tool, our new “Public Domain Mark” is now published for comment.

The PDM allows works already in the public domain to be marked and tagged in a way that clearly communicates the work’s PD status, and allows it to be easily discoverable. The PDM is not a legal instrument like CC0 or our licenses — it can only be used to label a work with information about its public domain copyright status, not change a work’s current status under copyright. However, just like CC0 and our licenses, PDM has a metadata-supported deed and is machine readable, allowing works tagged with PDM to be findable on the Internet. (Please note that the example used on the sample deed is purely hypothetical at the moment.)

We are also releasing for public comment general purpose norms — voluntary guidelines or “pleases” that providers and curators of PD materials may request be followed when a PD work they have marked is thereafter used by others. Our PDM deed as well as an upcoming enhanced CC0 deed will support norms in addition to citation metadata, which will allow a user to easily cite the author or provider of the work through copy-paste HTML.

This is absolutely critical to science, because it addresses at last the biggest reason that people mis-use copyright licenses on uncopyrightable materials and data sets: the confusion of the legal right of attribution in copyright with the academic and professional norm of citation of one’s efforts. Making it easy to cite, regardless of the law, is one of the keys to making the public domain something that we can construct through individual private choice at scale, not just by getting governments to adopt.

The public comment period will close on Wednesday, August 18th. Why so short? For starters, PDM is not a legal tool in the same sense our licenses and CC0 are legally operative — no legal rights are being surrendered or affected, and there is no accompanying legal code to finesse. Just as importantly, however, we believe that having the mark used soon rather than later will allow early adopters to provide us with invaluable feedback on actual implementations, which will allow us to improve the marking tool in the future.

The primary venue for submitting comments and discussing the tool is the cc-licenses mailing list. We look forward to hearing from you!

There are a lot of fascinating projects around how to do the non-legal work of data. The Sage Commons has seen a bunch of them come together, but in this context I want to call out – the SageCite project driven by UKOLN, the University of Manchester, and the British Library – which is going to develop and test an entire framework for citation, not attribution, using bioinformatics as a test case.

My own hope is that by making citation inside Creative Commons legal tools that work on the public domain a cut-and-paste process, we can facilitate the emergence of frameworks like SageCite so that the legal aspects fade away on the data sets and databases themselves, and the focus can be on the more complex network models of complex adaptive systems. And I’m tremendously excited to see members of the community leveraging the Sage project to do independent, crucial work on the topic of citation. Like Wikipedia, Sage won’t work unless it is something that we all own together and work on for our own reasons.

This is only still the beginning of really open data – public domain data – that complies with the Panton Principles. Creative Commons has spent six long years studying the open data issue, and rolling out policy and tools and technologies that make it possible for end users from the Dutch government to the Polar Information Commons to create their own open data systems.

We still have to avoid the siren song of property rights on data, and of license proliferation. But it’s starting to feel like momentum is gaining on public domain data, and for the Creative Commons tools that make it a reality. Making citation one-click, and making it easy to tag and mark the public domain, is part of that momentum. Please help us by commenting on the tools, and by promoting their use when you run across any open data project where the terms are unclear.

Read the comments on this post…

rdf:about="Shakespeare"

Dorothea has written a typically good post challenging the role of RDF in the linked data web, and in particular, its necessity as a common data format.

I was struck by how many of her analyses were spot on, though my conclusions are different from hers. But she nails it when she says:

First, HTML was hardly the only part of the web stack necessary to its explosion. TCP/IP, anyone?

I’m on about this all the time. The idea that we are in web-1995-land for data astounds me. I’d be happy if I were to be proven wrong – trust me, thrilled – but I don’t see the core base of infrastructure for a data web to explode. I see an exploding capability to generate data, and of computational capacity to process data. I don’t see the technical standards in place that enable the concurrent explosion of distributed, decentralized data networks and distributed innovation on data by users.

The Web sits on a massive stack of technical standards that pre-dated it, but that were perfectly suited to a massive pile of hypertext. The way that the domain name system gave human-readable domains to dotted quads lent itself easily to nested trees of documents linked to each other, and didn’t need any more machine readable context than some instructions to the computer about how to display the text. It’s vital to remember both that we as humans are socially wired in to use documents in a way that was deeply enabling to the explosion of the Web, because all we had to do was standardize what those documents looked like and where they were located.

On top of that, at exactly the moment in time that the information on the web started to scale, a key piece of software emerged – the web browser – that made the web, and in many ways the computer itself, easier to use. The graphic web browser wasn’t an obvious invention. We don’t have anything like it for data.

My instinct is that it’s going to be at least ten years’ worth of technical development, especially around drudgery like provenance, naming, versioning of data, but also including things like storage and federated query processing, before the data web is ready to explode. I just don’t see those problems being quick problems, because they aren’t actually technical problems. They’re social problems that have to be addressed in technology. And them’s the worst.

We simply aren’t yet wired socially for massive data. We’ve had documents for hundreds of years. We have only had truly monstrous-scale data for a couple of decades.

Take climate. Climate science data used to be traded on 9-track tapes – as recently as the 1980s. Each 9-track tape maxes out at 140MB. For comparison’s sake, I am shopping for a 2TB backup drive at home. 2TB in 9-tracks is a stack of tapes taller than the Washington Monument. We made that jump in less than 30 years, which is less than a full career-generation for a working scientist. The move to petabyte scale computing is having to be wedged into a system of scientific training, reward, incentives, and daily practice for which it is not well suited. No standard fixes that.

Documents were easy. We have a hundreds-of-years old system of citing others’ work that makes it easy, or easier, to give credit and reward achievement. We have a culture for how to name the documents, and an industry based on making them “trusted” and organized by discipline. You can and should argue about whether or not these systems need to change on the web, but I don’t think you can argue that the document culture is a lot more robust than the data culture.

I think we need to mandate data literacy the way we mandate language literacy, but I’m not holding my breath that it’s going to happen. Til then, the web will get better and better for scientists, the way the internet makes logistics easier for Wal-Mart. We’ll get simple mashups, especially of data that can be connected to a map. But the really complicated stuff, like oceanic carbon, that stuff won’t be usable for a long time by anyone not trained in the black arts of data curation, interpretation, and model building.

Dorothea raises another point I want to address:

“not all data are assertions” seems to escape some of the die-hardiest RDF devotees. I keep telling them to express Hamlet in RDF and then we can talk.

This “express Hamlet in RDF” argument is a Macguffin, in my opinion – it will be forgotten by the third act of the data web. But damn if it’s not a popular argument to make. Clay Shirky did it best.

But it’s irrelevant. We don’t need to express Hamlet in RDF to make expressing data in RDF useful. It’s like getting mad at a car because it’s not an apple. There is absolute boatloads of data out there that absolutely needs to be expressed in a common format. Doing climate science or biology means hundreds of databases, filling at rates unimaginable even a few years ago. I’m talking terabytes a day, soon to be petabytes a day. That’s what RDF is for.

It’s not for great literature. I’ll keep to the document format for The Bard, and so will everyone. But he does have something to remind us about the only route to the data web:

Tomorrow and tomorrow and tomorrow,
Creeps in this petty pace from day to day

It’s going to be a long race, but it will be won by patience and day by day advances. It must be won that way, because otherwise we won’t get the scale we need. Mangy approaches that work for Google Maps mashups won’t cut it. RDF might not be able to capture love, or literature, and it may be a total pain in the butt, but it does really well on problems like “how do i make these 49 data sources mix together so I can run a prediction of when we should start building desalination plants along the Pacific Northwest seacoast due to lower snowfall in the Cascade Mountains”.

That’s the kind of problem that has to be modelable, and it has to run against every piece of data possible. It’s an important question to understand as completely as can be. The lack of convenience imposed by RDF is a small price to pay for the data interoperability it brings in this context, to this class of problem.

As more and more infrastructure does emerge to solve this class of problem, we’ll get the benefits of rapid incremental advances on making that infrastructure usable to the Google Maps hacker. We’ll get whatever key piece, or pieces, of data software that we need to make massive scale data more useful. We’ll solve some of those social problems with some technology. We’ll get a stack that embeds a lot of that stuff down into something the average user never has to see.

RDF will be one of the key standards in the data stack, one piece of the puzzle. It’s basically a technology that duct-tapes databases to one another and allows for federated queries to be run. Obviously there needs to be more in the stack. SPARQL is another key piece. We need to get the names right. But we’ll get there, tomorrow and tomorrow and tomorrow…

Read the comments on this post…

Of Pepsi and ScienceBlogs…

I’ve gotten a few emails about the Pepsi-ScienceBlogs tempest. It’s clearly taken a toll on ScienceBlogs’ credibility. Some of my SciBlings have resigned in protest, and others are taking shots on the topic.

Sponsorship is part of scientific publishing, even in the peer reviewed world. Remember how Merck published an entire fake journal to promote Vioxx? How much money gets spent on reprints that support a company’s position, on articles paid for with corporate research funds?

Today’s hullaballoo is more honest than either of those. My gut reaction is: calm down, world. This was a miserable rollout in which a lack of transparency and community engagement turned a little fire into a conflagration.

I’m not going to resign my blog here, at least, not now. I am not a sponsored blog. I receive salary from my (non profit) employer, Creative Commons, and I also take in a little consulting revenue and the odd speaking fee, all in the service of promoting the digital commons. I am checking with all my arrangements to see if I’m allowed to fully disclose, and if I can, I’ll publish my list here.

I also know Adam Bly pretty well on a personal level. He’s a good guy. In full disclosure, he’s been a supporter of Creative Commons personally, and Seed supports the organization professionally as well. But I don’t think that colors my opinion today. He’s not out to sell bad science; he’s out to transform scientific publishing on the internet. I have no doubt that this decision was arrived at after lengthy debate and internal argument.

But I don’t know anything more than what I’ve read on the internet. I filter my SB mail for reading once a week or so, as I get too much email every day as there is. So I found out about this when my twitter feed exploded today.

I am sanguine about the realities of running a site like ScienceBlogs. It’s not free. I’ve run a company. I know what it’s like to hire people in fat times and to lay them off in lean times, how hard it is to tell investors that the revenues are drying up. It’s a perspective that is hard earned. It’s a reality that forces decisions in support of shareholders, not just in support of bloggers and readers. And in a massive recession that becomes even more true.

But that perspective means that the choice is understandable, not that the situation was handled well. If a site like SB is going to do this, then the entire process must be painfully transparent. I’ve watched as sites I love, like Fark.com and some of the various Gawker blogs, began to accept sponsored links – but they are LABELED as such. These SB blogs need to be plastered with the fact that they are indeed bought space, bought by companies, not by individuals thinking freely (like the rest of us). Different graphic design, disclaimer text in the templates, that sort of thing. I would personally love to see an piece of RDFa that my browser can auto-ignore, just as I block pop-up ads.

This screenshot makes it pretty clear it’s sponsored, and that it’s “advertorial” content. It’s a good start, though too late to stop the frenzy that is an internet blamestorm.

It’s not something that can be done post-hoc, is the problem. The distinguishing between content and advertisement needs to be done in a fashion that is transparent to the community at large, because although the decision to accept sponsored blogs may help shareholders, it affects the people who the sponsored blogs want to associate with (us free thinking bloggers) and those they want to read the sponsored blogs (that’s you, people). And we’re the community that got smeared by the rollout.

That’s the anger, that’s what is driving the reaction here. We weren’t consulted (the royal we) in advance. And even if we hadn’t said anything smart or interesting, getting the chance to chime in on this type of thing would have released the tension in a way that created more trust, not less trust. I’m going to argue for more transparency from SB as to their finances and decisionmaking, but I’m not going to leave because of one false step.

Because I’ve made some myself, and I believe in treating others as I would like to be treated.

Obviously, I reserve the right to change my mind as new data rolls in. If the site continues to display the sort of tin ear it has displayed in this one, then I’ll have to refactor. But for today, I’m sticking around, and urging calm thinking and open minds.

Read the comments on this post…

Kaitlin Thaney moves on…

I tend to want to make posts on Creative Commons related topics at the CC blog, but this is essentially a personal post, and I also want to have it as widely read in our community as possible.

Today is Kaitlin Thaney‘s last day at CC. She’s been working for us on the Science Commons project for a long time – starting part time in mid 2006, full time in early 2007 – and she’s been an absolutely essential part of our success over the years.

I first met Kaitlin because she was interning, while finishing at Northeastern, for a joint MIT-Microsoft project called iCampus. She started showing up at science talks and asking good questions, and I poached her so we could have her help us with our first Science Commons international data sharing conference, held at the US National Academies. Here are her hands organizing nametags that day.

From the get-go, she’s been an incredible employee. She has taken on every task without question, and shown a remarkable level of skill and savvy and capacity, moving from the boring (nagging me on projects) to the remarkable (working with the Polar Information Commons to get their data towards the public domain) to the ridiculous (finding ironically appropriate plush kidney toys to give to our counsel). She’s also become a dear friend, all the way to flying to Brazil to be a part of my wedding last year.

Kaitlin leaves us for a remarkable opportunity in online science that I am not going to describe in detail here. She’ll be speaking for herself on this topic later in July. All I know is that we, as a group, will miss her, and that I as an individual will miss her too. It’s a good move she’s making, and I wish her all the best. You should follow her on twitter and subscribe to her blog.

Thank you, KT. You’ve been a linchpin.

Read the comments on this post…

Brains Open Access Initiative

sbzombies_common-knowledge.png

An old tradition and a new technology have converged to make possible an unprecedented good. The old tradition is the willingness of scientists and scholars to publish the fruits of their research – coming from their brains – in scholarly journals without payment, for the sake of inquiry and knowledge. The problem with this approach is that the brains are not exposed, just the thoughts, and that the brains available have been those physically accessible, such as those at the local university. Thus, those desiring to gain new knowledge through the consumption of peer-reviewed brains have been restricted in their capacity by physical and economic realities.

The new technology that changes everything is the low-cost economy airline. The public good the low-cost airline makes possible is the world-wide distribution of peer-reviewed brains and completely free and unrestricted access to them by all scientists, scholars, teachers, students, and anyone hungry for brains. It is also now possible to visit new areas and taste brains from multiple disciplines, multiple nationalities, and in multiple cuisines.

Removing access barriers to these brains will accelerate satiety, enrich custards, share the brains of the rich with the poor and the poor with the rich, and lay the foundation for uniting humanity in a common intellectual conversation and quest for good brains recipes.

For various reasons, this kind of free and unrestricted availability, which we will call open access, has so far been limited to small portions of the world’s brains. But even in these limited collections, many different initiatives have shown that open access to brains is economically feasible, that it gives us extraordinary power to find and make use of relevant brains, and that it gives brains and their works vast and measurable new visibility, readership, impact, and fresh, seasonal preparations. To secure these benefits for all, we call on all interested institutions and individuals to help open up access to the rest of these brains and remove the barriers, especially the price barriers, that stand in the way. The more who join the effort to advance this cause, the sooner we will all enjoy the benefits of open access to brains.

The brains that should be freely accessible online are those which scholars give to the world without expectation of payment. Primarily, this category encompasses their peer-reviewed academic brains, but it also includes any unreviewed child brains that they might wish to expose for comment or to alert colleagues to tasty young brains. There are many degrees and kinds of wider and easier access to these brains. By “open access” to brains, we mean their free availability via the public airport system, permitting any users to bake, saute, fry, sear, grill, slow cook under pressure, or use as an ingredient in a savory tart, to muddle them for soup, pass them as basis for stock, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the air system itself. The only constraint on cooking and distribution, and the only role for property rights in this domain, should be to give chefs control over the integrity of their work and the right to be properly acknowledged and cited.

While  the peer-reviewed brains should be accessible without cost to eaters, brains are not costless to produce. However, experiments show that the overall costs of providing open access to brains are far lower than the costs of traditional forms of dissemination. With such an opportunity to save money and expand the scope of dissemination at the same time, there is today a strong incentive for professional associations, restaurants, grocery stores, Costco, and others to embrace open access as a means of advancing their missions. Achieving open access will require new cost recovery models and financing mechanisms, but the significantly lower overall cost of dissemination is a reason to be confident that the goal is attainable and not merely preferable or utopian.

To achieve open access to scholarly brains, we recommend two complementary strategies. 

I.  Self-Archiving: First, scholars need the tools and assistance to deposit their brains in open archives, a practice commonly called self-archiving. When these archives conform to standards created by the Open Archives Initiative, then search engines and other tools can treat the separate archives as one. Users then need not know which archives exist or where they are located in order to find and make use of their brains.

II. Open-Access Brains: Second, scholars need the means to launch a new generation of brains committed to open access, and to help existing brains that elect to make the transition to open access. Because brains should be disseminated as widely as possible, these new brains will no longer invoke physical property rights to restrict access to and use of the grey matter, white matter, and surrounding fluids. Instead they will use property rights and other tools to ensure permanent open access to all the brains they review. Because price is a barrier to access, these new brains will not charge subscription or access fees, and will turn to other methods for covering their expenses. There are many alternative sources of funds for this purpose, including the foundations and governments that fund procreation, the universities and laboratories that possess students inclined to make more brains, endowments set up by discipline or institution, friends of the cause of open access, profits from the sale of add-ons to the basic brains, funds freed up by the demise or cancellation of brains charging traditional subscription or access fees, or even contributions from the researchers themselves. There is no need to favor one of these solutions over the others for all disciplines or nations, and no need to stop looking for other, creative alternatives.

Open access to peer-reviewed brains is the goal. Self-archiving (I.) and a new generation of openaccess brains (II.) are the ways to attain this goal. They are not only direct and effective means to this end, they are within the reach of scholars themselves, immediately, and need not wait on changes brought about by markets or legislation. While we endorse the two strategies just outlined, we also encourage experimentation with further ways to make the transition from the present methods of brain dissemination to open access. Flexibility, experimentation, and adaptation to local circumstances are the best ways to assure that progress in diverse settings will be rapid, secure, and mouthwatering.

The Open Brain Institute, the foundation network founded by philanthropist George Romero, is committed to providing initial help and funding to realize this goal. It will use its resources and influence to extend and promote institutional self-archiving, to launch new openaccess brains, and to help an openaccess brain system become economically self-sustaining. While the Open Brain Institute’s commitment and resources are substantial, this initiative is very much in need of other organizations to lend their effort and resources.

We invite governments, restaurants, grocers, cooking shows, home cooks, learned societies, professional associations, and individual scholars who share our vision to join us in the task of removing the barriers to open access and building a future in which brains in every part of the world are that much more free to poach in a light cream sauce.

(based on, and with apologies to, the Budapest Open Access Initiative)
(H/T to Joseph Hewitt and Ataraxia Theater for the wicked cool zombie image!)
(for some really open access brain stuff, check out the Neurocommons.)

Read the comments on this post…

Open Data and Creative Commons: It’s About Scale…

As part of the series of posts reflecting on the move of Science Commons to Creative Commons HQ, I’m writing today on Open Data.

I was inspired to start the series with open data by the remarkable contribution, by GSK, to the public domain of more than 13,000 compounds known to be active against malaria. They were the first large corporation to implement the CC0 tool for making data into open data. CC0 is the culmination of years of work at Creative Commons, and the story’s going to require at least two posts to tell…

Opening up data was a founding aspect of the Science Commons project at CC. I came to the Creative Commons family after spending six years mucking about in scientific data, first trying to make public databases more valuable at my startup Incellico, and later at the World Wide Web Consortium (W3C) where I helped launch the interest group on the semantic web for life sciences. When I left the W3C in late 2004, data was my biggest passion – and it remains a driving focus of everything we do at Creative Commons.

Data is tremendously powerful. If you haven’t read the Halevy, Norvig & Pereira article on the unreasonable effectiveness of data, go do so, then come back here. It’s essential subtext for what we do at Creative Commons on open data. But suffice to say that with enough data, a lot of problems become tractable that were not tractable before.

Perhaps most important in the sciences, data lets us build and test models. Models of disease, models of climate, models of complex interactions. And as we move from a world in which we analyze at a local scale to one where we analyze at global scale, the interoperability of data starts to be an absolutely essential pre-condition to successful movement across scales. Models rest on top of lots of data that wasn’t necessarily collected to support any given model, and scalable models are the key to understanding and intervening in complex systems. Sage Bionetworks is a great example of the power of models, and the Sage Commons Congress a great example of leveraging the open world to achieve scale.

Building the right model, and responding to the available data, is the difference between thinking we have 100,000 genes or 20,000. Between thinking carbon is a big deal in our climate or not. And scale is at the heart of using models. Relying on our brains to manage data doesn’t scale. Models – the right ones – do scale.

My father (yeah, being data-driven runs in the family!) has done years of important work of the importance of scale that I strongly recommend. His work relates to climate change and climate change adaptation, but it applies equally to most of the complex, massive-scale science out there today. Scale – and integration – are absolutely essential aspects of data, and it is only by reasoning backward from the requirements imposed by scale and integration that we are likely to arrive at the right uses cases and tasks for the present day, whether that be technical choices or legal choices about data.

We chose a twin-barreled strategy at Creative Commons for open data.

First, the semantic web was going to be the technical platform. This wasn’t in a belief that somehow the semantic web would create a Star Trek world in which one could declaim “computer, find me a drug!” and get a compound synthesized. We instead arrived at the semantic web by working backward from the goal of having databases interoperate the way the Web interoperates, where a single query into a search engine would yield results from across tens of thousands of databases, whether or not those databases were designed to work together.

We also wanted, from the start, to make it easy to integrate the web of data and the scholarly literature, because it seemed crazy that articles based on data were segregated away from the data itself. The semantic web was the only option that served both of those tasks, so it was an easy choice – it supports models, and it’s a technology that can scale alongside its success.

The second barrel of the strategy was legal. The law wasn’t, and isn’t, the most important part of open data – the technical issues are far more problematic, given that I could send out a stack of paper-based data with no legal restraints and it’d still be useless.

But dealing with the law is an essential first step. We researched the issue for more than two years, examining the application of Creative Commons copyright licenses for open data, the potential to use the wildly varied and weird national copyright regimes or sui generis regimes for data, the potential utility of applying a contract regime (like we did in our materials transfer work), and more. Lawyers and law professors, technologists and programmers, scientists and commons advocates all contributed.

In our search for the right legal solution for open data, we held conferences at the US National Academy of Science, informal study sessions at three international iSummit conferences, and finally a major international workshop at the Sorbonne. We drew in astronomers, anthropologists, physicists, genomicists, chemists, social scientists, librarians, university tech transfer offices, funding agencies, governments, scientific publishers, and more. We heard a lot of opinions, and we saw a pattern emerge. The successful projects that scaled – like the International Virtual Observatory Alliance or the Human Genome Project – used the public domain, not licenses, for their data, and managed conflicts with norms, not with law.

We also ran a technology-driven experiment. We decided to try to integrate hundreds of life science data resources using the Linux approach as our metaphor, where each database was actually a database package and integration was the payoff. We painstakingly converted resource ater resource to RDF/OWL packages. We wrote the software that wires them all together into a single triple store. We exposed the endpoint for SPARQL queries. And we made the whole thing available for free.

As part of this, we had to get involved in the OWL 2 working group, the W3C’s technical architecture group, and more. We had to solve very hairy problems about data formats. We even had to develop a new set of theories about how to promote shared URIs for data objects.

Like I said, the technology was a hell of a lot hairier than the law. But it worked. We get more than 37,000 hits a day on the query endpoint. There’s at least 20 full mirrors in the wild that we know of. It’s beginning to scale.

But because of the law, we also had to eliminate good databases with funky legal code, including funky legal code meant to foster more sharing. We learned pretty quickly that the first thing you do with those resources is to throw them away, even if they meant well in licensing. The technical work is just too hard. Adding legal complexity to the system made the work intolerable. When you actually try to build on open data, you learn quickly that unintended use tends to rub up against legal constraints of any kind, share-alike constraints equal to commercial constraints.

We would never have learned this lesson without actually getting deep into the re-use of open databases ourselves. Theory, in this case, truly needed to be informed by practice.

What we learned, first and foremost, was that the combination of truly open data and semantic web supports the use of that data at Web scale. It’s not about open spreadsheets, or open databases hither and yon. It’s not about posting tarballs to one’s personal lab page. Those are laudable activities, but they don’t scale. Nor does applying licenses to the data that impose constraints on downstream use, because the vast majority of uses of data aren’t yet known. And today’s legal code might well prevent them. Remember that the fundamental early character of the Web was public domain, not copyleft. Fundamental stuff needs fundamental treatment.

And data is fundamental. We can’t treat it, technically or legally, like higher level knowledge products if we want it to serve that fundamental role. The vast majority of it is in fact not “knowledge” – it is the foundation upon which knowledge is built, by analysis, by modeling, by integration into other data and data structures. And we need to begin thinking of data as foundation, as infrastructure, as a truly public good, if we are to make the move towards a web of data, a web that supports models, a world in which data is useful at scale.

I’ll return to the topic in my next post to outline exactly how the Creative Commons toolkit – legal, technical, social – serves the Open Data community.

Read the comments on this post…

On Science Commons’ Moving West…

I’ve kept this blog quiet lately – for a wide range of reasons – but a few questions that have come in have prompted me to start up a new series of posts.

The main reason for the lack of posts around here is that I’ve been very busy, and for the most part, I’ve used this blog for a lot of lengthy posts on weighty topics. At least, weighty to me. If you want a more informal channel, you can follow me on twitter, as I prefer tweeting links and midstream thoughts to rapid-fire short blog entries. The joy of a blog like this for me is the chance to explore subjects in greater depth. But it also means that during times of extreme hecticness, I won’t publish here as much.

Anyhow. I’ve been busy with a pretty big task, which is getting me, my family, and the Science Commons operation moved from Boston to San Francisco. We’re moving from our longtime headquarters at MIT into the main Creative Commons offices, and it’s a pretty complex set of logistics on both personal and professional levels.

As an aside, I’m now very close to some downright amazing chicken and waffles, and that’s exciting.

Now, I would have thought this would have been interpreted by the world in the clear manner that I see it: us Science Commons folks are, and have always been, part and parcel of the Creative Commons team, so this didn’t strike me as super-important if you’re not one of the people who has to move. If you email us, our addresses end with @creativecommons.org. That’s where our paychecks come from. So having us integrate into the headquarters offices doesn’t seem such a big deal. But I keep getting rumbles that people think we’re somehow “going away” or “disappearing” – that’s why there’s going to be a series of posts on the move and its implications.

So let me be as blunt as possible: Science at Creative Commons, and the work we do at the Science Commons project, isn’t going anywhere. We are only going to be intensifying our work, actually. You can expect some major announcements in the fall about some major new projects, and you’ll learn a lot about the strategic direction we plan to take then. I can’t talk about it all yet, because not all the moving pieces are settled, but suffice to say the plans are both Big and Exciting. We’ve already added a staff member – Lisa Green – who is both a Real Scientist and experienced in Bay Area science business development, to help us realize those plans.

Our commitments and work over the past six years of operations aren’t going anywhere either. We will continue to be active, vocal, and visible proponents of open access and open data. We will continue to work on making biological materials transfer, and technology transfer, a sane and transparent process. And our commitment to the semantic web – both in terms of its underlying standards and in terms of keeping the Neurocommons up and running – is a permanent one.

You can catch up with our achievements in later posts, or follow our quarterly dispatches. We get a lot of stuff done for a group of six people, and that’s not going to change either.

Some things *are* likely to change. For example, I don’t like the Neurocommons name for that project much any more – it’s far more than neuroscience in terms of the RDF we distribute, and the RDFHerd software will wire together any kind of database that’s formatted correctly. But those changes are changes of branding, not of substance in terms of the work.

It is, however, now time to get our work and the powerful engine that is the Creative Commons headquarters together. I’m tired of seeing the fantastic folks that I work with twice a year. We’re missing a ton of opportunities to bring together knowledge in the HQ – especially around RDFa and metadata for things like scholarly norms – by being physically separated. Not to mention that the San Francisco Bay Area is perhaps the greatest place on earth to meet the people who change the world, every day, through technology.

I’m also tired of living on the road. I’m nowhere near Larry Lessig and Joi Ito in terms of my travel, but I’m closing in on ten years of at least 150,000 miles a year in airplanes. It gets old. Most of our key projects at this point are on the west coast, like Sage Bionetworks and the Creative Commons patent licenses, and we’re developing a major new project in energy data that is going to be centered in the Bay Area as well. The move gives me the advantage of being able to support those projects, which are much more vital to the long term growth of open science than conference engagements, without 12 hours of roundtrip plane flights.

I’ll be looking back at the past years of work in Boston over the coming weeks here. I’m in a reflective mood and it’s a story that needs to be told. We’ve learned a lot, and we’ve had some real successes. And we’re not abandoning a single inch of the ground that we’ve gained in those years. So if you hear tell that we’re disappearing or going away, kindly point them here and let them know they will have us around for quite some time into the future…

Read the comments on this post…

Open Hardware

Creative Commons was fortunate enough to be involved in a fascinating workshop last week in New York on Open Hardware. Video is at the link, photos below.

The background is that I met Ayah Bdeir at the Global Entrepreneurship Week festivities in Beirut, and we started talking about her LittleBits project (which is, crudely, like Legos for electrics assembly – even someone as spatially impaired as me could build a microphone or pressure sensor in minutes).

Ayah introduced me to the whole open hardware (OH) world and asked a lot of very good, hard to answer questions about how to use CC in the context of OH. It became clear that a lot of the people involved in the movement didn’t have a clear grasp of how the various layers of intellectual property might or might not apply.

Ayah suggested in February that we put together a little workshop – almost a teach-in – around a meeting of Arduino advocates happening in NYC on the 18-19 of March. In a matter of three weeks, we got representatives from a bunch of major players to commit: Arduino (world’s largest open hardware platform), BugLabs, Adafruit, Chumby, Make magazine, even Chris Anderson. Mako Hill from the Free Software Foundation came and @rejon made it there at the last minute too, wearing his openmoko and qi hardware hats. Eyebeam hosted it for free, and we picked up the snacks and cheese trays.

I gave a very short intro laying out how the science commons project @ creative commons has spent a lot of time looking at IPRs as a layered problem, dealing with it at data levels, materials levels, and patent levels, as well as the fact-idea-expression relationships in science. This was to create some context for why we might have interesting ideas.

Thinh proceeded to deliver a masterful lecture on IP that went on for hours, though intended to be 30 minutes. It was an interactive, give-and-take, wonderful session to watch, ranging from copyrights to mask works to trade secrets to trademarks and patents. The folks there liked it enough to suspend the break period after five minutes and dive back into IP.

After that we had a lengthy interactive session driven by the OH folks in which they tried to decide what a declaration of principles might look like, how detailed to get, how to engage in existing efforts to do similar things (like OHANDA), the role of the publishers like Wired and Make to support definitions of open hardware, and how open one had to be in order to be open.

There was no formal outcome at the close of business, but I expect a declaration or statement of some sort to emerge (akin to the Budapest Declaration on Open Access from my own world of scholarly publishing). There’s clearly a lot of work to be done. And the reality is that copyrights and patents and trademarks and norms and software and hardware are going to be hard to reconcile into a simple, single license that “makes copyleft hardware” a reality. But it was fun to be in a room with so many passionate, brilliant people who want to make the world a better place through collaborative research.

More to come once results emerge…

Read the comments on this post…