Monday, January 31, 2011

The What, Where and How of Open Data

Last week I attended a seminar at the Cathie Marsh Centre for Census and Survey Research, given by Rufus Pollock of the Open Knowledge Foundation (OKFN) on the topic of "open data".

Rufus started by showing two example applications built using open data. Yourtopia makes use of data from the World Bank that measures individual nations progress towards the Millennium Development Goals. Visitors to the site balance the relative importance of different factors (for example, "health", "economy" and "education"), and their preferences are matched with the data in order to suggest which country meets them most closely. Where Does My Money Go? offers various breakdowns of UK government spending and presents these in a way that allows the site visitor to see (for example) how much of the tax they pay is used for things such as defence, environment, culture and so on.

Both sites are eye-catching and fun (and can provide some surprising insights), while at the same time serving more serious purposes. In the context of the seminar Rufus noted that building the two sites also highlighted some key issues when working with these kinds of datasets:
  • Completeness: i.e. the data are not always complete
  • Correctness: i.e. the data are not always correct
  • Ease-of-use: it can take a lot of effort to put the data into a format where it can actually be used (for example an estimated 90% of the time developing Where Does My Money Go?, as opposed to 10% actually building the site)
These issues can largely be mitigated by "open data", which has two key characteristics:
  • Legal openness: the data must be provided with a licence that allows anyone to use, reuse and redistribute the data, for any purpose. ("Reuse" in this context can include combining it with other datasets and redistributing that.) An explicit open licence is required (such as those offered at Open Data Commons) because the default legal position for any data - even that posted "openly" on the web - doesn't entitle someone else to reuse or redistribute.
  • Technical openness: the data should be in a format that means that it's easy to access and work with, that it should be possible to obtain the data in bulk, and in a machine-readable, open format. These are pre-requisites for the data to be useful in a practical sense: for example, it's not sufficient to provide the data via a website that only returns subsets of that data via a form submission.
(See the official definition at http://www.opendefinition.org/.)

The data itself can be about almost anything: geographical (for example, mapping postcodes to a latitude and longitude), statistical, electoral, legal, financial - the OKFN's CKAN (Comprehensive Knowledge Archive Network) site has many examples. The key point is that the data should not be personal - that is, it shouldn't enable individuals to be identified, either directly or indirectly.

The motivation for making data open goes back to the initial issues of completeness, correctness and ease-of-use - it can take a lot of time to assemble a dataset (for example, the Government already collects a lot data), but once the effort has been made then the added cost of releasing it is small, and then sharing it reduces the cost of merging, filling gaps and correcting errors. To make an analogy with open source software, it's a essentially Linus' Law for data: "given enough eyeballs, all bugs are shallow". Rufus also talked about a corollary to this, the "many minds" principle: the best use of the data you produce will probably be thought of by someone else (and vice versa).

One argument against openness is that it precludes the possibility of commercial exploitation in order to offset the costs of compiling the data, and is a topical point given the current economic climate. Rufus's counter-argument is that there are many other ways to fund the creation of data aside from making it proprietary, by considering the data as a platform (rather than as a product), and building on that platform to sell extensions or complementary services (such as consultancy - again there are parallels with open source software). (Some of the audience expressed also concerns that in principle at least, open data is might be used irresponsibly - but arguably if the data is available to all then it means that others could challenge that interpretation.)

The final point that Rufus's talk addressed is how to actually build the open data ecosystem. To some degree it's up to the people who hold the data, but his suggestions are:
  • Start small and simple (which I took to mean, start with small sets of data rather than doing everything all at once).
  • If you're using someone else's dataset then you can make an enquiry via the OKFN website to find out what the licensing situation is.
  • If you have your datasets then put them under an open data licence and can register it at CKAN so that others can find it.
  • "Componentize" your data to make it easier to reuse (which I took to mean, divide the datasets up into sensible subsets).
  • Make the case with whoever holds the data you want (government, business etc) to release it openly.
For me as a "lay person", this was a fascinating introduction to the world of open data. Not unreasonably the seminar didn't go into details of actually working with such data (I think many of the seminar audience members were researchers already familiar with the available tools). However afterwards Rufus made the point that writing a paragraph of text after looking at the data is just as valid as the slick visualisations provided by Where Does My Money Go? and other sites. Ultimately it's having open access to the data in the first place that counts.

Sunday, January 23, 2011

Python North-West: The Python Challenge

Last week I went to my first-ever Python North-West meeting, at the Manchester Digital Laboratory (aka MadLab). The webpage describes it as a "user group for Pythoneers and Pythonistas of all levels and ages, open to everyone coding 'the way Guido indented it'", and meetings alternate between talks and coding dojos (group coding sessions where people get to share code and ideas with the aim of improving their knowledge and skills - see http://codingdojo.org/cgi-bin/wiki.pl?CodingDojo for more information).

This particular meeting was a coding dojo and so as a group we worked through The Python Challenge (http://www.pythonchallenge.com/), which is a series of puzzles that can be solved using Python programming combined with some imagination and lateral thinking. While most people had come with their own laptops, the format that developed was for one person to "drive" the laptop connected to the overhead projector, typing in code and taking suggestions from the others.

Although I'd already looked at the first two challenges earlier in the day to get an idea of what was involved, the group setting provided a great opportunity to see how other people worked, and to learn about bits of Python that I was unfamiliar with - one example for me was being introduced to list comprehensions, which are concise ways to generate lists, e.g.:

>>> [[x,x**2] for x in vec]
[[2, 4], [4, 16], [6, 36]]

(although there were several other examples which I won't write about here so as not to spoil the challenges for others). Also, as many of the challenges began with having to figure out what the programming problem actually was, it meant that collectively we didn't get stuck for too long on any particular puzzle - I know that at least a couple would have had me completely stumped if I'd been on my own. For me personally it was also an opportunity to play with IDLE - Python's IDE - under Windows (not an environment that I've used much in the past but quite handy for this kind of exploratory programming process.)

Overall it was great to get out and interact with other Python developers in an enthusiastic and friendly atmosphere, while at the same time broadening my knowledge of the language - and now I've had a taste I'll definitely be back for future meetings.