Monthly Archives: April 2006

Distributing the Geograph Archive with BitTorrent?

Geograph attracts the occasional bit of unfriendly leeching, where someone mirrors the entire site at high speed. Considering we’ve got close to 250,000 pages, that can be quite a bandwidth hit for a non-profit entity.

We’ve policed this manually up until now, but we’ll certainly need to add some automated software protection to separate legitimate browsers from robots (maybe using Kitten Auth!!).

Trouble is, we are archiving reusable content, so we really want people to pull down the whole thing and do something interesting with it.

We could set up something like a bandwidth-limited rsync server to do this, but it’s still eating our precious bandwidth. Which has got me thinking that maybe BitTorrent is the way to go.

In effect, we’d create a fairly large torrent file listing every JPEG image file. We can initially seed it from our off-site backups, and then encourage as many people as possible to download and seed it further.

Not only would we conserve server bandwidth, we’ll have created a highly distributed secondary backup!

I wonder whether a torrent of many thousands of files is pushing it a bit. Reading a BitTorrent specification it does look possible, although the .torrent file might be rather large, since it will be listing every image file. What we may have to do is split it up into several “volumes”, so have a torrent for each chunk of 50,000 images for example.

Why not just create a tarball of the images? Well, that would make for a small .torrent but I don’t think it would encourage people to seed it after downloading. A tarball would be essentially useless, many people will simply uncompress it and discard the tarball, preventing further seeding. If the torrent gives you immediately useful files, then hopefully we would see more people seeding it.

Could be a few months before I get around to giving this a try, but it sounds good, yes?

(Edit – implemented at long last – see this post for details.

Lazy wrapping with PHP5’s __call method

The overloading features in PHP5 allow you to intercept calls to unknown methods at runtime. I recently used this feature to rapidly develop a class wrapper for Xapian, and it’s interesting to see how the technique works and what it achieves.

In this article, I’ll explain how you can use __call to wrap a flat function-based PHP5 library into a set of classes with minimal effort. I’ll use Xapian as an example – it has a SWIG generated binding which flattens the library’s C++ class interface into a set of functions.

Using a wrapper we can turn code like this…

$enquire = new_Enquire($database);
Enquire_set_query($enquire, $query);
$matches = Enquire_get_mset($enquire, 0, 10);

…into code like this…

$enquire = new XapianEnquire($database)
$enquire->set_query($query);
$matches = $enquire->get_mset(0, 10);

This allows you to use the library in a similar way to others using different languages, aiding communication, but also allows you to realise all the usual benefits of OO software design, such as deriving from these classes to provide specialised functionality.

So far so good, but wait until you see how lightweight the implemention of that XapianEnquire wrapper class is…

class XapianEnquire extends XapianWrapper { }

That’s not a typo. That’s all you need. Instantiate one of those and you can call all the methods you would expect from the Xapian documentation. Try and call a method which doesn’t exist and it will throw an exception.

Voodoo, surely?
Continue reading