Archive for August, 2008

Filed Under (Uncategorized) by Dave Mast on August-7-2008

I could have sworn I posted this earlier in the week,  but I don’t see it in my blog anywhere.  So, here it is for real.

Last week we were in what I could call a “very dangerous position.” … our RAID controller on our VM host began throwing PCI parity errors, which REALLY doesn’t go over well on our Linux host OS.  This past Saturday night, I was able to take the host machine down and make it right.  I shut our VMs down and copied them over to a separate disk array to keep them safe.  Once that was done I went ahead and swapped the High Point RR2320 card out for an Adaptec 3805.  The Adaptec card got very good reviews from other Newegg customers, and the price/features balance was a deal maker.

After getting the build started on the new RAID 50 array (w00t for background builds), I did a reinstall of Fedora 8, formatted the new array, and started bringing our VMs back.  After installing VMware server and the web UI, I took a deep breath and pressed the start button to begin bringing the VMs online.  I got more and more relieved as each VM came back to life, and I probably broke out in cheer once we were 100% back online.  All that was left after that was to install NRPE so that our Nagios box could monitor the health of the VM host.

Some thoughts from this project…

- This couldn’t have timed out any better… since there was no church on Sunday, I was able to take 75% of our systems down with minimal impact on our users.

- I was REALLY wrestling with the thought of trying ESXi out on the host instead of Fedora…I imagine it would have worked.  However, the ability to monitor the host’s hardware with Dell OpenManage is trump.  Although OM is not supported on Fedora, it will run if all the dependencies are satisfied.

- I’m glad we’ve got a coffee machine in the church…I wouldn’t have made it through the night otherwise.

- My initial plan was to install CentOS on the this server, but I had problems with GRUB hanging at boot time after the installation.  I went back to Fedora because I just didn’t have time to mess around.

- I’m very thankful that my fiance (or wife, depending on when you read this) understands what I do and accepts the fact that I’m passionate about this stuff.

- Our server room’s AC works very well… almost too well.  Again, hot coffee was a plus.



Filed Under (Uncategorized) by Dave Mast on August-6-2008

I love being able to take a break from ministry to rest up and get rejuvenated.  This year is going to be even better due to the fact that I will be going on vacation with my wife. :-)  We’re planning on going up to Maine for a week.  Nice walks, cool weather, and probably more lobster than we can handle…. I can’t wait!

This year is quite a bit different because it’s the first year I’ve been out of the state and totally out of driving range of the office.  As you might expect, my boss is a little nervous about whether or not things stay running while I’m gone — I can’t say I blame him…this past week had me concerned as well.  Everything appears to be in good shape now though, and I feel much better about leaving than I did 5 days ago.  UPS shutdown scripts and backups are checking out good, and Todd’s got the documentation he needs to hopefully solve any situations that come up while I’m gone.

I’m ready to go on this vacation for quite a few reasons…first and foremost since Jess and I are getting married this weekend, this vacation is doubling as our honeymoon.  I can’t wait to be able to spend this time with her as we officially start our life together.  (I think the drive up there is going to be 1/3 the fun.)  I’ve never been to Maine, but I know we’re going to have a fantastic time.

Secondly… I’m ready to get away.  I would be lying if I said I haven’t been feeling a lot of stress lately or if I said my brain wasn’t ready to reduce itself to a pile of mush.  There’s been a whole lot on my mind the past few weeks.  I need to put all that aside for awhile and unplug my brain as much as I can — No work-related computer use (except if Todd calls), no ActiveSync, and probably no twitter or blogging either (as if THAT will make a dent in my posting frequency). My goal at this point is to try as hard as I can to remain “dark” for at LEAST the first 11 days.  Is that possible…I don’t know.  Am I going to try…you bet.

Lastly, I’m ready to rest.  Maybe you can relate to this and maybe you can’t, but I have a VERY hard time shutting my brain down.  And I’ve not been sleeping well lately. At all.  Between wedding prep, moving out of my apartment, and late-night fixes on our infrastructure… man… my body has not handled it well at all.  Obviously this goes hand-in-hand with the stress that I’ve been feeling.  Either way, I’m ready to relax.

All this to say that I’m very excited to be starting on this journey this weekend.  I’ve prepped about as much as I know how, and gotten as much tied up as I can before leaving.  There’s still plenty of things to do, but all those things will still be here when I return.



Filed Under (Uncategorized) by Dave Mast on August-2-2008

I forget exactly how it was that I stumbled onto the Stuff Christians Like blog, but I thoroughly enjoy reading it.  Jon’s posts expose Christianese tendencies that just about all of us are familiar with.  Most of the time they’re funny, and I dig that.  Some times they take a serious turn, and I dig that as well.

I thought a recent post of Jon’s had some IT flavor to it, so I decided to reference it here: "Holy quotes at the end of emails"



Filed Under (Uncategorized) by Dave Mast on August-2-2008

Last week I worked on what I thought was an open-and-shut VM storage project for our VM host server: I had decided to replace the CERC RAID controller on our Dell PE1800 because the controller was not playing well with Linux.  This was causing a bigtime bottleneck, sometimes bringing disk i/o to a complete stop while the controller munched on data.  The solution was to replace both the controller and the connected disks with a High Point 2320, which would give us a good chunk of VM storage on a speedy RAID-50 array.

So last week, I powered down our 10 VMs and moved them to a different server for storage. After all the VMs had transferred, I pulled the CERC controller and the old disks that were with it and installed the 2320 and 8 new HDDs; 6 for RAID-50, one hot spare, and one OS disk that would attach to the Poweredge’s on-board SATA controller. The project took most of the night, and after 12 hours and a few mugs of coffee, I had our critical production VMs back online on their sleek new array.  Disk performance was outstanding compared to what it was on RAID-5 and as a result, the load numbers on the server dropped drastically. I called the project a success.

A couple days later while in the server room, I noticed that the blue status light on the Poweredge had started to flash orange. Thinking it might just be a loose panel somewhere, I inspected the server and found nothing. Some after-hours testing revealed that if I totally removed power from the server, I could get rid of the orange light.  However, within a few minutes of running, the light would go back to flashing orange.  Now I don’t know how you feel about your car’s "Check Engine" light being on constantly while you’re driving, but I don’t like it at all, and I equated this situation with just that.  With no apparent sign of trouble beyond the flashing light, I made a note to run a hardware diagnostic on the server next week.

Fast-forward a bit to this past Tuesday.  I woke up to the sound of my phone ringing. I was planning on going in to work a little late, but my plans were shot down by what Todd told me over the phone.  He was unable to access his Quickbooks server, and Exchange was unreachable as well.  A quick look at my phone showed alerts from our monitoring service around 5:30am…Exchange had apparently gone down and never came back up.

Once I got to my desk and fired up a VMware console, I was greeted by a myriad of errors for each VM.  Apparently the High Point RAID card was causing some I/O issues and PCI errors, enough so that Linux had unmounted the array, instantly killing about 10 VMs.  The only good news in this situation so far is that since this appeared to have happened at 5am, the nightly backups that were run the previous night would be enough to make Exchange almost completely current.  Holding my breath, I restarted the VM host, and watched as each VM came back to life.  After some looking around, each machine appeared to be in good form. All I could say was "wow." We dodge way too many bullets.

I had to take some time later in the day to get my head around this situation.  The prospect of doing another "RAID Transplant" just 2 weeks after changing drives/controller was very frustrating to think about, but I couldn’t just let it sit.  I’m about to go on a 2-week honeymoon.  I don’t want my boss to have to call me during that time to ask for help, and I certainly don’t want to come back early to do any fixes.

So, beginning tomorrow night, I’ll be taking our VM host down yet again to change the RAID controller out.  The project will start with transferring all of our VMDKs to the backup server just to be safe.  Then, I’ll be replacing the High Point 2320 for an Adaptec 3805 and transferring all the VMDKs back onto the new array (if needed).  I’ll probably finish up by Sunday evening (I plan on taking a break to fish in the afternoon).

Usually I feel pretty good about fixes, and maybe it’s just the looming "deadline" I have coming up, but I feel like I’m about to throw a Hail Mary pass on this one.  The blinking orange light and the errors only started after the High Point card was put in, but I’ve seen stranger acts of coincidence.  If we still wind up getting PCI errors after the card is swapped out, we’ll need a bigger badder solution.




FireStats iconPowered by FireStats