Author Archives: Paul Dixon

Digg and LiveJournal Scalability and Performance

Interesting article on Digg and PHP’s scalability and performance. The article mentions 8 slave database servers for just 3 webservers, which seemed an odd configuration.

I posted a question but the author didn’t elaborate much aside from linking to an interesting set of slides about LiveJournal. This walks through increasingly elaborate scaling scenarios. (Edit: these slides are from 2004, a later 2005 presentation has more information)

One interesting techique is lumping sets of users together into clusters, so that you can have a master DB for each cluster. Beyond that, they’ve built a master-master setup, where you perform your writes to twin masters. Looking at these techniques, it becomes a little clearer how you might have 3 webservers with 8 db servers.

The Zen of CSS Design

The CSS Zen Garden site is a fantastic demonstration of what CSS can do. A few days ago I found myself with a few hours to kill while waiting for my car to be repaired, and while browsing a bookshop came across a book about the site – The Zen of CSS Design

The book takes a selection of Zen Garden designs from a variety of designers, and walks through the process they took while covering any neat CSS tricks that were pulled off to achieve the end result.

It’s a pleasure to read, with a nice layout and feel to it, and chock full of useful followup links on the techniques demonstrated. Highly recommended!

Yet another of my never-ending side projects for Geograph has been reworking the UI, and I’m hoping to ensure the XHTML has enough “hooks” to enable other designer to easily come up with alternative stylesheets.

Anyway, here’s my current effort so far.

Geograph Design

Web-2.0-a-licious, no?

Distributing the Geograph Archive with BitTorrent?

Geograph attracts the occasional bit of unfriendly leeching, where someone mirrors the entire site at high speed. Considering we’ve got close to 250,000 pages, that can be quite a bandwidth hit for a non-profit entity.

We’ve policed this manually up until now, but we’ll certainly need to add some automated software protection to separate legitimate browsers from robots (maybe using Kitten Auth!!).

Trouble is, we are archiving reusable content, so we really want people to pull down the whole thing and do something interesting with it.

We could set up something like a bandwidth-limited rsync server to do this, but it’s still eating our precious bandwidth. Which has got me thinking that maybe BitTorrent is the way to go.

In effect, we’d create a fairly large torrent file listing every JPEG image file. We can initially seed it from our off-site backups, and then encourage as many people as possible to download and seed it further.

Not only would we conserve server bandwidth, we’ll have created a highly distributed secondary backup!

I wonder whether a torrent of many thousands of files is pushing it a bit. Reading a BitTorrent specification it does look possible, although the .torrent file might be rather large, since it will be listing every image file. What we may have to do is split it up into several “volumes”, so have a torrent for each chunk of 50,000 images for example.

Why not just create a tarball of the images? Well, that would make for a small .torrent but I don’t think it would encourage people to seed it after downloading. A tarball would be essentially useless, many people will simply uncompress it and discard the tarball, preventing further seeding. If the torrent gives you immediately useful files, then hopefully we would see more people seeding it.

Could be a few months before I get around to giving this a try, but it sounds good, yes?

(Edit – implemented at long last – see this post for details.