Free the Data!


Daniel Weitzner (W3C &
MIT),Daniel Harris (Kendra), andJeremy Frey (Univ. of Southhampton)
Free the Data! Panel
(click to enlarge)

A specially arranged panel session called Freeing the Data was moderated by Kieron O'Hara (Univ. of Southhampton). On the panel were Daniel Weitzner (W3C & MIT), Daniel Harris (Kendra), and Jeremy Frey (Univ. of Southhampton).

Jeremy Frey is a chemist and took the position that any scientist doing research should not only make results available, but the data as well. But making the data available isn't enough. We need to make it findable as well. Moreover, we need the context to be available and machine readable.

Another issue with data is correctness. Published papers have greater trust because they've been peer-reviewed. What about the data? There are several issues: first, the information about how the data was created must be preserved. Second, data must be versioned so that as it changes, reference can be made to the data set at a particular time. At least for scientific data, we should move to a position where you pay for privacy rather than publicity.

Harris took an opposing view that said authors should have the choice of whether to free data or not. What's the aim? Stabilization or destabilization? Are we talking about free access or free cost? What do we want? Legislation? Arguments? Tools? We have to decide.

Weitzner claimed that freeing data wasn't just about freedom, but about re-use. Beyond making data available, we must:

  • Re-use structures including schemas and ontologies. It's more important to use well-understood structures than to use any particular idiom.
  • Re-use the licenses that have already been developed. Licensing meta-data (ala Creative Commons) is also important.
  • Enable re-use of ideas (contrasted with the expression of the idea). We have to find the proper scope of 'derivative works' and re-examine the issue of database copyright. Shockingly, copying the bibliographic data from a work (for purposes of citation) can be seen as a violation of some licenses.
  • Attach policy information that says how the information can be used. Some experimental data depends critically on personally identifying information. Anonymization is a hard task either not working well or being at odds with the underlying research purpose of the data.
  • Use open standards

Someone from the audience gave an indication of the scale of the changing data problem. She said they make 5000 changes to their data per day. Traditional versioning systems won't work if this level of granularity is really this small.

Harris mentioned Brin's The Transparent Society as a good reference on the inherent conflict between freely available data and privacy.

Weitzner said we shouldn't make the mistake of believing that we can publish information to one (fairly large) group of people and keeping it away from billions of other people. That won't work and will end up harming us in the end.