Geograph attracts the occasional bit of unfriendly leeching, where someone mirrors the entire site at high speed. Considering we’ve got close to 250,000 pages, that can be quite a bandwidth hit for a non-profit entity.
We’ve policed this manually up until now, but we’ll certainly need to add some automated software protection to separate legitimate browsers from robots (maybe using Kitten Auth!!).
Trouble is, we are archiving reusable content, so we really want people to pull down the whole thing and do something interesting with it.
We could set up something like a bandwidth-limited rsync server to do this, but it’s still eating our precious bandwidth. Which has got me thinking that maybe BitTorrent is the way to go.
In effect, we’d create a fairly large torrent file listing every JPEG image file. We can initially seed it from our off-site backups, and then encourage as many people as possible to download and seed it further.
Not only would we conserve server bandwidth, we’ll have created a highly distributed secondary backup!
I wonder whether a torrent of many thousands of files is pushing it a bit. Reading a BitTorrent specification it does look possible, although the .torrent file might be rather large, since it will be listing every image file. What we may have to do is split it up into several “volumes”, so have a torrent for each chunk of 50,000 images for example.
Why not just create a tarball of the images? Well, that would make for a small .torrent but I don’t think it would encourage people to seed it after downloading. A tarball would be essentially useless, many people will simply uncompress it and discard the tarball, preventing further seeding. If the torrent gives you immediately useful files, then hopefully we would see more people seeding it.
Could be a few months before I get around to giving this a try, but it sounds good, yes?
(Edit – implemented at long last – see this post for details.