Monthly Archives: February 2007

Geograph brought down by sky2 network driver failure

Yesterday’s outage of the Geograph website was brought about by all three webservers rendering their network interfaces unusable due to a failed network driver. Although there are many references to similar failures, I thought it would be useful to write about it if only to give a little more Google-juice to the problem.

Geograph’s three webservers are running Ubuntu 6.06 LTS, regularly updated. The eth0 NIC is a Marvell Technology Group Ltd. 88E8050 Gigabit Ethernet Controller (rev 17), driven by the sky2 driver.

Each of those NICs failed at some point on Sunday, but the servers themselves kept on trucking, eventually writing entries like this to syslog

Feb 24 16:30:30 scone kernel: [35337220.416000] NETDEV WATCHDOG: eth0: transmit timed out
Feb 24 16:30:30 scone kernel: [35337220.416000] sky2 eth0: tx timeout
Feb 24 16:30:30 scone kernel: [35337220.420000] sky2 eth0: transmit ring 112 .. 89 report=112 done=112
Feb 24 16:30:30 scone kernel: [35337220.420000] sky2 hardware hung? flushing

There are many reports of similar failures. One suggested fix is replacement of the sky2 driver with sk98lin, but as our remote KVM is also down, we’re limited to actions we can reliably take over a network connection (in the short term at least).

So, for some short term protection against reoccurence, I’ve written a simple watchdog script called by cron every 5 minutes. It performs some network connectivity tests, and if they all fail, increments a counter. If the script is called and has failed for the 4th successive time, it will attempt to reload the sky2 module, and if that doesn’t work, trigger an immediate reboot. This should mean that a server will enter “radio silence” for around 15 minutes and recover. That’s a tolerable delay for a cluster of three servers.

Fortunately, we’ve found the second NIC on the machine uses an Intel Corporation 82541GI/PI Gigabit Ethernet, driven by the e1000 driver. By all accounts, this should be much more stable. So longer term we’ll be switching the cabling over to the second NIC.

So, the moral of the story is, don’t build a server which uses the sky2 driver!

Geograph Down!

Geograph has three redundant webservers – from around 6am they’ve all been down. I’m working to restore them now.

Edit: 7:50am – servers not responding to power cycles, this could be an extended outage. Made the load balancer redirect to a page explaining what it going on. Seems a trip to Heathrow is in order….

Edit: 8:22am – data centre staff are attending shortly, hopefully resolve this without a road trip…

Edit: 9:05am – looks like we’re back. Phew!

Big thankyou due to Mark at Fubra for his assistance on what was already a frantic Sunday morning for him!

Zend PDT – PHP plugin for Eclipse

I’ve been using PHPEclipse for PHP development for 9 months or so and finding it a real time saver. Sadly it seems that further development has stalled, possibly because of Zend’s announcement in late 2005 that they too would build their own Eclipse plugin.

That has finally borne some fruit with the recent release of a reasonably functional version of Zend PDT, or PHP Development Tool.

Snappy name.

Still, even though a final release isn’t slated until september, this version isn’t too shabby and appears to have a comparable feature set to PHPEclipse already. I haven’t tried the debugger yet, but editing features seem good, with folding of comment blocks and methods, help with PHPDoc tags, nice class inspector pane and reasonable parsing of php code for problems (it doesn’t seem to spot as many problems as PHPEclipse, but maybe I’ve missed a setting somewhere).

So far, it’s shaping up to be pretty good IDE, particularly if you are already using Eclipse. Remote debugging of PHP, in combination with Subclipse SVN plugin and the Web Tools Platform plugin, gives you a pretty capable development environment.

Methinks I’ll be switching permanently soon!