Events!

Lots of venues publish events listings on their own sites, but few publish full RSS/Atom feeds and even fewer publish iCalendar feeds that aggregators can subscribe to. Sites like Time Out have large, searchable directories of events, but they don't have everything and are difficult to customise.

In an attempt to remedy this, I've started writing scrapers for venues that I'm interested in, and a framework to run them in. It's a similar set-up to CiteULike's scrapers: they can be in any scripting language that'll run on the command line (mine are in PHP but there are good scraping libraries available for Ruby and Python); they just need to output their results in a standard format.

At the moment the standard format is iCalendar, but I think it'll probably make more sense to use an XML, RDF or JSON serialisation, depending on how complex the data ends up being, and adding iCalendar as an output option once the results have been processed.

One aim is to make a site that shows events on today at just the venues I'm interested in, but I'm sure there'll be other uses once the data is available. Hopefully it'll illustrate to venues the value of publishing decent structured events data too.

The code's on GitHub; if you have scrapers to contribute, send me an email or a pull request.

Comments

Oh, you could give XML_GRDDL a shot on a lot of webpages which you know might be rendering hcal.

Take a peek @ http://pear.php.net/package/XML_GRDDL/docs/latest/__filesource/fsource_XML_GRDDL__XML_GRDDL-0.1.1docsflickr-linkedin.php.html for instance

Neat! We need something for the DWC ;-)
Do you know about Songkick - http://www.songkick.com ? They have a huge repository of upcoming events.

Yes, Songkick are good, but they're one of those aggregators that don't have everything. I'd like to write scrapers for each of the original sources rather than rely on a third-party if possible.

I'd be surprised if many sources publish hCalendar, but GRDDL's certainly worth a try if they do.

I've written some scrapers to, but I'm trying to avoid that. A useful resource is FuseCal. If it can parse an event page, you can piggyback on it and harvest iCalendar data from it.

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.