We need a Wikipedia for data

April 9, 2008

I just started blogging. I am not sure what I want to write about, but I think one theme will be "things I want but want someone else to build." This article describes one of those things.

At Google, I worked on a number of projects that required data from third party data sources. We licensed mapping data for 100s of countries for Google Maps, movie showtimes data for Google Movies, and stock data for Google Finance, among many others.

After leaving Google and the company of the Google BizDev team, I have come to realize how hard it is for a everyday programmer to get access to even the most basic factual data. If you want to experiment with a new driving directions algorithm, it is infinitely more difficult than coming up with an algorithm; you have to hire a lawyer and a sign a contract with a company that collects that data in the country you are developing for. If you want to write an open source TiVo competitor, you need television listings data for every cable provider in the country, but your options are tenuous at best. In July, the most popular "free" listings service shut down their site, breaking most MythTV installations. The CD database (which is used to recognize CD track names when you rip CDs on your computer) has gone through a number of controversial transitions and license changes for similar reasons.

Even when data is available under a reasonable license, it often suffers from extremely serious quality or discoverability problems. The US Census Bureau publishes map data, but it only includes a small subset of the attributes required for a real mapping product. The Reuters corpus, which is a standard body of text used in data mining and information retrieval research, requires you to sign two agreements, send them to some organization via snail mail, and get the corpus via snail mail on CDs (what century is this, folks?).

I think all of these barriers to data are holding back innovation at a scale that few people realize. The most important part of an environment that encourages innovation is low barriers to entry. The moment a contract and lawyers are involved, you inherently restrict the set of people who can work on a problem to well-funded companies with a profitable product. Likewise, companies that sell data have to protect their investments, so permitted uses for the data are almost always explicitly enumerated in contracts. The entire system is designed to restrict the data to be used in product categories that already exist.

Imagine what amazing applications would be created if every programmer in the world had free access to all of these data sets:

  • Map data for all countries in a relatively uniform data format
  • White pages data (names and addresses) for all cities of the world
  • Stock data for all major exchanges for all time
  • Movie showtimes data for all cities in the world
  • Television schedule data for all cities in the world
  • Sports scores and stats for all sports in the world for all time
  • Rich meta data for all musical albums and movies from all labels for all time

The interesting thing is, almost every internet company would benefit if this data were freely available. Most internet companies have embraced open source operating systems because every company needs an operating system, and no company wants their OS to be a competitive advantage - they just want it to work. I would argue we are all in the same boat with these factual data sources. No one really wants factual data accuracy and completeness to be their competitive advantage; we all want the best data possible to build the best products possible, and discrepancies in data quality are artifacts of the extremely inefficient economy of buying and selling data we currently live in. If everyone had the same, high quality data, all of our products would be better for it.

To this end, I think we should create a Wikipedia for data: a global database for all of these important data sources to which we all contribute and that anyone can use. When a user reports an inaccurate phone number in your products, save it back to the DataWiki so everyone can benefit, and in return, you get everyone else's improvements as well. If your local movie theater doesn't have listings data in DataWiki, you can type it in yourself, and everyone in your town can benefit, and all the products you use that access movie listings will automatically update. Need better mapping data for a city? Pay to collect it, and upload it to the DataWiki. In return you get all the other cities other companies paid for (sort of like a company contributing device drivers to the Linux kernel).

DataWiki seems like an extremely hard problem, and I don't think it would work unless some big companies got on board and donated their data sets to bootstrap the process. However, I think all companies would benefit almost immediately from the quality improvements that would come from openness. Some data sets are more expensive to collect than others, and those certainly seem like the hardest data sets to make freely available.

I have some concrete ideas on how this could work for some data sets, but I will save them for future posts. In the meantime, what are some of the most interesting existing projects attempting to open up these data sources? I only know of a few, and none of them has really taken off.

Update: Check out this great summary of the sites people have mentioned in the comments on ReadWriteWeb.

Experimenting with Google App Engine

April 8, 2008

Google App Engine was actually the last project I worked on before I left Google. I was the PM of the project when it started, but has grown quite a bit since I left Google last June, and now it is has many more engineers and a handful of extremely talented PMs. I was fortunate enough to be able to see Kevin Gibbs and crew at Campfire One yesterday, and I could barely sit through the whole talk I was so excited to play around with the system.

I have been "meaning to" start a blog for months. Blog software is extremely simple to implement, so I figured it would be a great app to test out on the new App Engine infrastructure. This blog runs on the code I wrote this evening.

The lack of SQL is actually refreshing. Like Django and many other frameworks, you declare your data types in Python:

class Entry(db.Model):
    author = db.UserProperty()
    title = db.StringProperty(required=True)
    slug = db.StringProperty(required=True)
    body = db.TextProperty(required=True)
    published = db.DateTimeProperty(auto_now_add=True)
    updated = db.DateTimeProperty(auto_now=True)

I used a web framework we use at FriendFeed. It looks a lot like the webapp framework that ships with App Engine and web.py (which inspired both of them). It took virtually no effort to get it to work in App Engine thanks to App Engine's support for WSGI.

Running the application looks a lot like the App Engine examples:

application = web.WSGIApplication([
    (r"/", MainPageHandler),
    (r"/index", IndexHandler),
    (r"/feed", FeedHandler),
    (r"/entry/([^/]+)", EntryHandler),
])
wsgiref.handlers.CGIHandler().run(application)

Generating the front page is totally easy:

class MainPageHandler(web.RequestHandler):
    def get(self):
        entries = db.Query(Entry).order('-published').fetch(limit=5)
        self.render("main.html", entries=entries)

Generating the Atom feed is equally easy:

class FeedHandler(web.RequestHandler):
    def get(self):
        entries = db.Query(Entry).order('-published').fetch(limit=10)
        self.set_header("Content-Type", "application/atom+xml")
        self.render("atom.xml", entries=entries)

I wanted to use slugs in my URLs to entries to make them friendlier, so I had to do a query to lookup entries for entry URLs:

class EntryHandler(web.RequestHandler):
    def get(self, slug):
        entry = db.Query(Entry).filter("slug =", slug).get()
        if not entry:
            raise web.HTTPError(404)
        self.render("entry.html", entry=entry)

I also needed security for adding/editing blog entries. App Engine lets you use Google's account system, which is nice for small apps like this. Likewise, it knows which users are "admins" for the app, so I decided to use this built-in role to handle security for the blog: only admins can add/edit entries. First, I wrote a decorator that will automatically add admin security to any RequestHandler method (redirecting to the login page if the user is not logged in):

def administrator(method):
    @functools.wraps(method)
    def wrapper(self, *args, **kwargs):
        user = users.get_current_user()
        if not user:
            if self.request.method == "GET":
                self.redirect(users.create_login_url(self.request.uri))
                return
            raise web.HTTPError(403)
        elif not users.is_current_user_admin():
            raise web.HTTPError(403)
        else:
            return method(self, *args, **kwargs)
    return wrapper

My edit handler looks like this:

class NewEntryHandler(web.RequestHandler):
    @administrator
    def get(self):
        self.render("new.html")

    @administrator
    def post(self):
        entry = Entry(
            author=users.get_current_user(),
            title=self.get_argument("title"),
            slug=self.get_argument("slug"),
            body=self.get_argument("body"),
        )
        entry.put(entry)
        self.redirect("/entries/" + entry.slug)

I don't think this blog will ever get millions of page views, but it is pretty cool that it could in theory :) I didn't have to configure anything. I didn't need to make an account system to make an administrative section of the site. And the entire blog is less than 100 lines of code. I deployed by running a script, and I was done. No machines, no "apt-get install", no "sudo /etc/init.d/whatever restart", nothing.

I am impressed. The App Engine team has done a fantastic job, and I think they have already changed the way I do hobby projects.

The next logical question is: would I run a real business on infrastructure that is so different than everyone else's? If I change my mind about App Engine, what are my options? I am hoping a number of open source projects spring up as alternatives to lower the switching costs over the next year. I will be very interested to see how many startups take the leap and run on App Engine entirely in the meantime.

All posts »