Yesterday’s outage of the Geograph website was brought about by all three webservers rendering their network interfaces unusable due to a failed network driver. Although there are many references to similar failures, I thought it would be useful to write about it if only to give a little more Google-juice to the problem.
Geograph’s three webservers are running Ubuntu 6.06 LTS, regularly updated. The eth0 NIC is a Marvell Technology Group Ltd. 88E8050 Gigabit Ethernet Controller (rev 17), driven by the sky2 driver.
Each of those NICs failed at some point on Sunday, but the servers themselves kept on trucking, eventually writing entries like this to syslog
Feb 24 16:30:30 scone kernel: [35337220.416000] NETDEV WATCHDOG: eth0: transmit timed out
Feb 24 16:30:30 scone kernel: [35337220.416000] sky2 eth0: tx timeout
Feb 24 16:30:30 scone kernel: [35337220.420000] sky2 eth0: transmit ring 112 .. 89 report=112 done=112
Feb 24 16:30:30 scone kernel: [35337220.420000] sky2 hardware hung? flushing
There are many reports of similar failures. One suggested fix is replacement of the sky2 driver with sk98lin, but as our remote KVM is also down, we’re limited to actions we can reliably take over a network connection (in the short term at least).
So, for some short term protection against reoccurence, I’ve written a simple watchdog script called by cron every 5 minutes. It performs some network connectivity tests, and if they all fail, increments a counter. If the script is called and has failed for the 4th successive time, it will attempt to reload the sky2 module, and if that doesn’t work, trigger an immediate reboot. This should mean that a server will enter “radio silence” for around 15 minutes and recover. That’s a tolerable delay for a cluster of three servers.
Fortunately, we’ve found the second NIC on the machine uses an Intel Corporation 82541GI/PI Gigabit Ethernet, driven by the e1000 driver. By all accounts, this should be much more stable. So longer term we’ll be switching the cabling over to the second NIC.
So, the moral of the story is, don’t build a server which uses the sky2 driver!