Last week I worked on what I thought was an open-and-shut VM storage project for our VM host server: I had decided to replace the CERC RAID controller on our Dell PE1800 because the controller was not playing well with Linux. This was causing a bigtime bottleneck, sometimes bringing disk i/o to a complete stop while the controller munched on data. The solution was to replace both the controller and the connected disks with a High Point 2320, which would give us a good chunk of VM storage on a speedy RAID-50 array.
So last week, I powered down our 10 VMs and moved them to a different server for storage. After all the VMs had transferred, I pulled the CERC controller and the old disks that were with it and installed the 2320 and 8 new HDDs; 6 for RAID-50, one hot spare, and one OS disk that would attach to the Poweredge’s on-board SATA controller. The project took most of the night, and after 12 hours and a few mugs of coffee, I had our critical production VMs back online on their sleek new array. Disk performance was outstanding compared to what it was on RAID-5 and as a result, the load numbers on the server dropped drastically. I called the project a success.
A couple days later while in the server room, I noticed that the blue status light on the Poweredge had started to flash orange. Thinking it might just be a loose panel somewhere, I inspected the server and found nothing. Some after-hours testing revealed that if I totally removed power from the server, I could get rid of the orange light. However, within a few minutes of running, the light would go back to flashing orange. Now I don’t know how you feel about your car’s "Check Engine" light being on constantly while you’re driving, but I don’t like it at all, and I equated this situation with just that. With no apparent sign of trouble beyond the flashing light, I made a note to run a hardware diagnostic on the server next week.
Fast-forward a bit to this past Tuesday. I woke up to the sound of my phone ringing. I was planning on going in to work a little late, but my plans were shot down by what Todd told me over the phone. He was unable to access his Quickbooks server, and Exchange was unreachable as well. A quick look at my phone showed alerts from our monitoring service around 5:30am…Exchange had apparently gone down and never came back up.
Once I got to my desk and fired up a VMware console, I was greeted by a myriad of errors for each VM. Apparently the High Point RAID card was causing some I/O issues and PCI errors, enough so that Linux had unmounted the array, instantly killing about 10 VMs. The only good news in this situation so far is that since this appeared to have happened at 5am, the nightly backups that were run the previous night would be enough to make Exchange almost completely current. Holding my breath, I restarted the VM host, and watched as each VM came back to life. After some looking around, each machine appeared to be in good form. All I could say was "wow." We dodge way too many bullets.
I had to take some time later in the day to get my head around this situation. The prospect of doing another "RAID Transplant" just 2 weeks after changing drives/controller was very frustrating to think about, but I couldn’t just let it sit. I’m about to go on a 2-week honeymoon. I don’t want my boss to have to call me during that time to ask for help, and I certainly don’t want to come back early to do any fixes.
So, beginning tomorrow night, I’ll be taking our VM host down yet again to change the RAID controller out. The project will start with transferring all of our VMDKs to the backup server just to be safe. Then, I’ll be replacing the High Point 2320 for an Adaptec 3805 and transferring all the VMDKs back onto the new array (if needed). I’ll probably finish up by Sunday evening (I plan on taking a break to fish in the afternoon).
Usually I feel pretty good about fixes, and maybe it’s just the looming "deadline" I have coming up, but I feel like I’m about to throw a Hail Mary pass on this one. The blinking orange light and the errors only started after the High Point card was put in, but I’ve seen stranger acts of coincidence. If we still wind up getting PCI errors after the card is swapped out, we’ll need a bigger badder solution.