Category Archives: ubuntu

Geograph brought down by sky2 network driver failure

Yesterday’s outage of the Geograph website was brought about by all three webservers rendering their network interfaces unusable due to a failed network driver. Although there are many references to similar failures, I thought it would be useful to write about it if only to give a little more Google-juice to the problem.

Geograph’s three webservers are running Ubuntu 6.06 LTS, regularly updated. The eth0 NIC is a Marvell Technology Group Ltd. 88E8050 Gigabit Ethernet Controller (rev 17), driven by the sky2 driver.

Each of those NICs failed at some point on Sunday, but the servers themselves kept on trucking, eventually writing entries like this to syslog

Feb 24 16:30:30 scone kernel: [35337220.416000] NETDEV WATCHDOG: eth0: transmit timed out
Feb 24 16:30:30 scone kernel: [35337220.416000] sky2 eth0: tx timeout
Feb 24 16:30:30 scone kernel: [35337220.420000] sky2 eth0: transmit ring 112 .. 89 report=112 done=112
Feb 24 16:30:30 scone kernel: [35337220.420000] sky2 hardware hung? flushing

There are many reports of similar failures. One suggested fix is replacement of the sky2 driver with sk98lin, but as our remote KVM is also down, we’re limited to actions we can reliably take over a network connection (in the short term at least).

So, for some short term protection against reoccurence, I’ve written a simple watchdog script called by cron every 5 minutes. It performs some network connectivity tests, and if they all fail, increments a counter. If the script is called and has failed for the 4th successive time, it will attempt to reload the sky2 module, and if that doesn’t work, trigger an immediate reboot. This should mean that a server will enter “radio silence” for around 15 minutes and recover. That’s a tolerable delay for a cluster of three servers.

Fortunately, we’ve found the second NIC on the machine uses an Intel Corporation 82541GI/PI Gigabit Ethernet, driven by the e1000 driver. By all accounts, this should be much more stable. So longer term we’ll be switching the cabling over to the second NIC.

So, the moral of the story is, don’t build a server which uses the sky2 driver!

Geograph servers coming to life

Geograph's new serversWell, here they are, racked up and ready to roll. This setup will breath some fresh life into Geograph which is struggling to cope with its popularity at the moment. Here’s a few factoids about the setup, if anyone wants to know more, just ask and I’ll write about it.

  • The 2U unit is called “Jam” and has 6 400GB SATA drives in a RAID5 array with a hot spare, providing 1.6TB of storage. It also has dual 3GHz Xeon CPUs and 4GB RAM. This machine provides photo storage and the database
  • There are 3 1U units called “Toast”, “Scone” and “Crumpet” again with dual 3GHz Xeon CPUs and 4GB RAM. These are the main webservers
  • There is a 1U unit called “Tea” with a single 3Ghz Pentium 4 with 4GB RAM, this is primarily a load balancer
  • We also have a remote power switch and a remote KVM switch, allowing us to perform most maintainance remotely
  • All the servers run Ubuntu 6.06 LTS, with the exception of “Jam” where we had problems booting from the RAID array after installation. Jam runs Debian Sarge instead.
  • We are moving away from SourceForge for the code and bug tracking, using our own installation of Trac and Subversion instead.
  • The load balancing will be carried out by HAProxy, which will allow us to carry out a very smooth changeover. We’ll simply proxy the old server right up until we’re ready to go live, at which point we have a short period of downtime while we synchronise the databases.

Here’s a few more pictures. Exciting stuff – click for full size goodness.

Front of servers
This is the front. Woooo!
Back of the servers…and here is the back. Check out that neat cabling job!


Soon as we’re ready for the big switchover, we’ll announce it on the site….almost ready now!

Installing VMWare Server Beta on Ubuntu 6.06 Dapper Drake

Edit: There’s now a more thorough guide over at HowToForge

I started with a server install of Ubuntu, and to install VMWare server you need a desktop, a compiler, kernel headers and a few other odds and ends

Easy way to get a desktop is go the whole hog and a give a home to a Gnome…

sudo apt-get install ubuntu-desktop

Reboot and experience Ubuntu in all its brown chocolately glory. Next you need some build tools, none of this stuff comes by default, so go get it..

sudo apt-get install build-essential

The install wants inetd or xinetd, I went with xinetd…

sudo apt-get install xinetd

Now get your kernel version with uname

uname -a
Linux hostname 2.6.15-23-server ...

Use that info to fetch the kernel headers you need

sudo apt-get install linux-headers-2.6.15-23-server

I found it useful to go and symlink the headers to the place the VMWare installer expects to find them

cd /usr/src/
sudo ln -s linux-headers-2.6.15-23-server linux

Now you are ready to install VMWare Server. Grab the Linux tarball from the VMWare site, unpack it, and run the installer

sudo ./vmware-install.pl

You should be able to accept the defaults for most questions unless you have specific virtual machine networking needs. Once installed, start the VMWare console from the Applications -> System Tools menu.

Easy!