Author Archives: Paul Dixon

HertsLUG March 2007

Restoring a Virtual Machine Snapshot - amazing photoshop skillsBeen a bit lax in attending the Hertfordshire Linux User Group meetings of late. Finally made it last night and gave a talk on virtual machines which I’d been promising since December. I mainly concentrated on the free VMWare Server, since I use it for day-to-day development, but others made mention of their experiences with QEMU and the intriguing EasyVMX which allows online creation of virtual machines for use in VMWare Player.

The image above was a hastily constructed graphic illustrating how a destroyed virtual machine could be quickly replaced with a backup snapshot, which I demonstrated by performing an “rm / -Rf” on a virtual machine, which was fun.

The talk seemed to go down well anyway, and I hope to become more of a regular attendee!

Geograph’s Second Birthday

Geograph is two years old today! We recently filled 50% of all the grid squares, and have over 350,000 images submitted by 3750 photographers, all available for reuse under the Creative Commons licence. Recent press coverage and the Yahoo “Find of the Year” award have driven up usage recently, and it won’t be long before we routinely average 1000 new photos a day.

It’s been a pretty good year for the project, with the Ordnance Survey sponsorship deal, new servers coming online and increased publicity.

Though I’ve found my spare time quite stretched recently, I hope to have the tagging features complete “real soon now”. Communications and server maintenance have mopped up a lot of time recently though. Beyond that, we’ll be working with Ordnance Survey to further develop the education potential of the site over the rest of the year, and hopefully rolling out an improved site design too.

(Edit: Geograph was Radio 2’s site of the day!)

Geograph brought down by sky2 network driver failure

Yesterday’s outage of the Geograph website was brought about by all three webservers rendering their network interfaces unusable due to a failed network driver. Although there are many references to similar failures, I thought it would be useful to write about it if only to give a little more Google-juice to the problem.

Geograph’s three webservers are running Ubuntu 6.06 LTS, regularly updated. The eth0 NIC is a Marvell Technology Group Ltd. 88E8050 Gigabit Ethernet Controller (rev 17), driven by the sky2 driver.

Each of those NICs failed at some point on Sunday, but the servers themselves kept on trucking, eventually writing entries like this to syslog

Feb 24 16:30:30 scone kernel: [35337220.416000] NETDEV WATCHDOG: eth0: transmit timed out
Feb 24 16:30:30 scone kernel: [35337220.416000] sky2 eth0: tx timeout
Feb 24 16:30:30 scone kernel: [35337220.420000] sky2 eth0: transmit ring 112 .. 89 report=112 done=112
Feb 24 16:30:30 scone kernel: [35337220.420000] sky2 hardware hung? flushing

There are many reports of similar failures. One suggested fix is replacement of the sky2 driver with sk98lin, but as our remote KVM is also down, we’re limited to actions we can reliably take over a network connection (in the short term at least).

So, for some short term protection against reoccurence, I’ve written a simple watchdog script called by cron every 5 minutes. It performs some network connectivity tests, and if they all fail, increments a counter. If the script is called and has failed for the 4th successive time, it will attempt to reload the sky2 module, and if that doesn’t work, trigger an immediate reboot. This should mean that a server will enter “radio silence” for around 15 minutes and recover. That’s a tolerable delay for a cluster of three servers.

Fortunately, we’ve found the second NIC on the machine uses an Intel Corporation 82541GI/PI Gigabit Ethernet, driven by the e1000 driver. By all accounts, this should be much more stable. So longer term we’ll be switching the cabling over to the second NIC.

So, the moral of the story is, don’t build a server which uses the sky2 driver!