MY blog was down for a good 6 hours yesterday, and would’ve been down for a lot longer. Sometimes I tend not to follow my own advice and this could very easily have been avoided.
To put this story into context you need to know where I moved my blog a few months back. My blog was initially hosted on uk2.net’s servers where I used to work. While it’s always been reliable, they had no IPv6 roll-out plans and I wanted to test both IPv6 on my site as well as in my core network. So I moved my blog into my lab here at my current job and it’s been running there ever since.
While we religiously monitor all our equipment, we don’t monitor lab equipment. So I can guess you already see where this is going.
My lab server is an HP G5 DL380 with 2 X 3GHz quad cores and 8GB ram. It has a smart array controller with 2 X 72GB Raid 1 and 6 X 72GB Raid 5 disks. This has VMWare ESX 4.0 installed and is controlled by a separate, monitored vcenter server elsewhere in the network. We use a generational backup application that backs up VMs through the vcenter server.
So with all that out-of-the-way, let’s see what happened.
It all started a few days ago when my NOC told me that the backup application was reporting an error when attempting to do a backup of the lab server. I checked to see if my blog was still running and it was. However we could not log directly onto the ESX server via the vsphere client or even SSH. It was decided that we would restart the management tools on the ESX lab server as it seemed that was the issue. i.e. VM’s were running but we could not manage the box.
Now, I really should follow my own advice. At this point I should’ve logged onto my blog, dumped the database, and gzipped /var/www and the dump. But I didn’t. I stupidly thought a simple restart of the management tools would fix the problem.
I connected a monitor and keyboard to my ESX host and was greeted with the familiar screen telling me to manage it remotely via the IP. I opened a new console window with alt+f2 and attempted to login. I typed my username and pressed enter, only for it to sit there forever not asking for my password. Ok. Let’s try a reboot.
Bad move. On the reboot nothing happened. My screen stayed off, but the servers fans were spinning. Let’s try unplugging the server for 10 minutes and plug everything back. This time the fans rted up at full speed, no screen on monitor and the fans did not spin down at all.
I’ve seen this type of problem before and it’s usually a dodgy piece of kit of something not plugged in correctly. So one by one I started removing components, switching on, switching off, insert component, remove another, repeat…
Eventually it turned out that it did not want to start with 2 of the RAM modules. Fine. I set those aside and started the server with 6GB.
Now the server is attempting to start, but the array controller on boot is complaining that the raid 5 array is not good. I insert my smartstart CD and open up the array configuration utility. There I can see the raid 5 array in rebuilding. It’s assumed one of the disks I had in that array is new. Note at this point I’ve not actually touched any disks. So I let the rebuild continue and it steadily gets to 89%, then back down to 0% with the alert ‘waiting for rebuild’
I check the stock room and find another 10K serial ATA 72GB 2.5″ disk and stick that in. My array starts rebuilding again.
And then stops at 89% again…
Ok, so the array won’t rebuild, but in theory my data should be fine with a disk missing for now. So I pull the disk out again.
Start up the server and ignore the complaints and VMWare ESX starts to boot up. VMWare ESX is installed on the unaffected Raid 1 disk. It eventually boots, only for the EXACT same problems I had before to start again. I can’t log in via the console. I can’t log in remotely. I can’t do anything really. Not only that but I’m getting all kinds of SCSI errors and VMWare errors on my console screen. I punch the VMWare error into google only for 0 results to come up. Fantastic.
It’s at this point I start to really worry. I noted above that the NOC said we were backing this server up. What exactly what getting backed up? We had a look and the only thing being backed up was esx.conf – That’s right. 1 single configuration file and nothing else. It makes sense really as this is a lab server.
So thinking that I’m now going to lose all the work I’ve put into the blog, I need to get a bit creative.
I wanted to see if I could boot into a live disk and ‘see’ the second disk. That way I might be able to get my .vmdk files back at least. I tried Bart’s PE disk (Windows-based) and through diskpart was able to see my 350GB raid 5 disk. However the partition type was unknown.
Note: Really I should’ve got a better live CD that can actually read vmware formatted disks. Any suggestions? Please comment!
Ok so now I’m getting desperate. It’s also getting close to 18:00 and I really don’t want to be spending my entire Friday night trying to recover this.
So what else can read a vmware formatted disk? Well vmware of course! My final goal was to reinstall vmware 4.0 from the original disk on the first disk, and hope and pray that it actually manages to read the second disk with all my VMs.
I couldn’t find the original disk so downloaded the .iso from vmware. I had no blank DVDs so used unetbootin to extract the ios onto a USB stick. This doesn’t work. It get’s halfway through installing only for it to complain the disk is not in the drive. I finally manage to locate the original DVD, only for the server to refuse to boot from it.
This was a stupid mistake. Someone has swapped out the DVD drive with a CD drive and hence the CD drive could not read the DVD. Pop a DVD drive in and it finally starts installing VMWare.
VMWare installed, reboot. I’m now finally able to log into the ESX box remotely! However there are no VMs in the inventory. Ok, I do see my 350GB disk as an available datastore so let’s right-click and browse.
And I see all my folders :) Right-click my blog vm, add to inventory. Start that beast up.
mellowd.co.uk/ccie is FINALLY back online!!!
I immediately set up a .vmdk level backup of my blog and run it. I also do a mysql dump of my database and do a file-level backup. I’ve set my .vmdk backup to run daily. This really should’ve been done from the start!
The server still refuses to rebuild the raid 5 array so the server is not in perfect health. I will be moving the VM to a properly monitored box shortly.
All in all I had dodgy ram, a faulty raid 5 array, and a borked VMWare install. My guess is that the dodgy ram was to blame for VMWare getting all messed up.
Oddly enough, none of the alert lights on the front of the HP server were showing red.