Category Archives: Geograph

Anything to do with the Geograph British Isles website

Distributing the Geograph Archive with BitTorrent?

Geograph attracts the occasional bit of unfriendly leeching, where someone mirrors the entire site at high speed. Considering we’ve got close to 250,000 pages, that can be quite a bandwidth hit for a non-profit entity.

We’ve policed this manually up until now, but we’ll certainly need to add some automated software protection to separate legitimate browsers from robots (maybe using Kitten Auth!!).

Trouble is, we are archiving reusable content, so we really want people to pull down the whole thing and do something interesting with it.

We could set up something like a bandwidth-limited rsync server to do this, but it’s still eating our precious bandwidth. Which has got me thinking that maybe BitTorrent is the way to go.

In effect, we’d create a fairly large torrent file listing every JPEG image file. We can initially seed it from our off-site backups, and then encourage as many people as possible to download and seed it further.

Not only would we conserve server bandwidth, we’ll have created a highly distributed secondary backup!

I wonder whether a torrent of many thousands of files is pushing it a bit. Reading a BitTorrent specification it does look possible, although the .torrent file might be rather large, since it will be listing every image file. What we may have to do is split it up into several “volumes”, so have a torrent for each chunk of 50,000 images for example.

Why not just create a tarball of the images? Well, that would make for a small .torrent but I don’t think it would encourage people to seed it after downloading. A tarball would be essentially useless, many people will simply uncompress it and discard the tarball, preventing further seeding. If the torrent gives you immediately useful files, then hopefully we would see more people seeding it.

Could be a few months before I get around to giving this a try, but it sounds good, yes?

(Edit – implemented at long last – see this post for details.

Some goals for April

I’m going to set myself a few goals for April, as there’s a variety of things I’ve been putting off. I’m hoping that by “going public” it will spur me to get them done!

Complete tagging system for Geograph

Geograph has a single “category” for each image and I’ve been working on-and-off since January to replace this with a decent tagging system. It’ll make such a difference to the usability of the archive and open up new possibilities, so it can’t happen soon enough. I’ll write about the design decisions made too…

Learn Python

This “Why Python?” article by Eric Raymond got me thinking I should try and give Python another crack. The next time I find myself turning to Perl I’ll see what Python has to offer. I like Perl for speed of development and the CPAN library, but I find it really time consuming to write decent OO code in Perl, so I rarely bother. Let’s see if I can if I learn to love Python too!

Release source code to pastebin.com

It’s been a year since the last release, and I get a steady stream of emails asking for the source. During April I will tidy the source, add a few common feature requests and release it under the GPL.

I think that’s enough. Let’s see how it goes…

RESTful images

Thought I’d play around and make a WordPress plugin to show a selected Geograph image, and form the beginnings of a REST style API for Geograph.

First thing was to provide a way to obtain picture metadata – this was pretty straightforward – requesting a URL like this

Returns an XML result like this

<geograph>
<status state=”ok”/>
<title>From Bygrave to Baldock</title>
<gridref>TL2435</gridref>
<user profile=”http://www.geograph.org.uk/profile.php?u=2″>Paul Dixon</user>
<img src=”http://www.geograph.org.uk/photos/04/02/040212_f4e3079a.jpg” width=”480″ height=”640″/>
</geograph>

I knocked up a plugin which maintained a list of “interesting” picture ids, and if it has been displaying one for more than 24hrs, requests the metadata for the next image in the queue, pulls in the image and resizes it suit this layout, and generates a cached HTML fragment for display

I could do with adding some extra metadata to the result, as well making new REST style APIs for user profiles, grid squares etc. We already have an API for bulk data retrieval which needs a key, but these simpler APIs might enable more little gizmos like this one to be produced.