Archive for the ‘power’ Category
This past Saturday night I was able to get a script working on our servers that would take them all down gracefully in the event of a power outage. This was in response to an previous blunder on my part that had allowed every last server in our building to go down hard (although nothing was damaged in the event).
Fast forward to this Tuesday (earlier this week). Our area got rocked by a couple of huge storms going through the area. About 3/4 of the way through the first storm, the lights in the office flickered, dimmed, and finally went out. I didn’t think the UPS script would get tested this quickly.
When I flipped open the KVM on our server rack, I was pleased to see that a countdown was already running on each machine. The UPS software had run its script, and now each server was about 60 seconds away from shutting down automatically. When it was all said and done, every last server shut down on its own, with plenty of battery life to spare.
The only tweak I ended up making to this process so far involved our physical domain controller (we have two, and the other one is a VM). It resides in one of the IDFs and it shut down too quickly in response to the power outage. As a result, after the VM-based DC went down, the remaining servers had no DC to talk to, and thus took a longer time to shut down cleanly. All-in-all though, the real-life test proved successful, and as a result. I have one more reason to sleep better at night.
|
After tinkering around with PowerAlert a little more tonight (yeah, I know, it’s Saturday), I stumbled onto some interesting things about the program, and ultimately got it working the way I want.
First, as far as executing command scripts: You need to make sure that the file, even if it’s a CMD or BAT file, has its permissions set properly. PowerAlert attempts to run the scripts as SYSTEM (<LOCALMACHINE>\SYSTEM), so not only did I need to set permissions on the script files to reflect that, I also had to set permissions on psshutdown.exe as well.
After doing that, things worked like a charm. Once the UPS lost power, the monitoring server starts a 2-minute timer. At the end of that 2 minutes, PsShutdown starts, reads a text file containing the names of the servers to shut down, and sends the shutdown command to each one. Plus, if the power would happen to kick back on within those 60 seconds, a second script is run that calls PsShutdown to cancel the previous shutdown commands.
Since our firewalls (we run pfSense) are running on PCs as well, one of the next steps will be to find a utility that can automatically SSH-telnet to each firewall and shut it down as well.
There’s still a couple minor tweaks that I want to do, but they will have to wait until the UPS battery is back up to 100%. For now though, I’m pretty happy that PowerAlert is working like it’s supposed to. AND, I feel a lot better knowing that the servers are going to take themselves down gracefully next time we lose power.
|
It seems like things I’ve put at the back burner since the move have started to move quickly towards the edge of the stove on their own…
During my drive home from Granger on Wedesday evening, my phone started receiving text messages stating that either our Exchange Server had gone down, or our internet connection had been totally hosed. Minutes passed, then an hour, and I didn’t receive any notification that things had come back up. Finally, about 90 minutes later, my Q started buzzing again to tell me that connectivity had been restored. Relieved that things appeared to be back to normal, I continued my trek back to Dover.
The next day came in to work and found no signs that anything was wrong. Servers were working properly, my phone didn’t have any voicemail on it, and it seemed that our infrastructure HADN’T turned into a flaming ball of molten aluminum while I was gone. However, when I logged into our backup server to work on our CommVault configuration, I got hit up with a prompt asking me to explain an unexpected shutdown that took place on Wednesday night.
As I logged onto each server and looked at the event logs and did some asking around, it became pretty apparent what had happened: The notices that I received on the way home from GCC were actually a result of our server rack’s UPS running out of juice after a lengthy power outage. Our servers, every last one, had gone down. Hard. Ouch.
How EVERY server managed to start back up with no errors is beyond me…WAY beyond me. But needless to say, the once-back-burner task of automating a shutdown process for our servers has come straight to the front.
I’ve spent the past 2 days dorking around with Tripp-Lite’s PowerAlert software. My plan is to have it execute a PsTools command called PsShutdown, which has the ability to shut down any windows machine on your network remotely. So far though, I’ve yet to see any evidence of PowerAlert even trying to run the script. I’m going to mess with it a little more and then give Tripp-Lite’s tech support a call. In any event, I found myself thanking God yesterday that we were able to get off so easy from my mistake of doing IT “on the edge” like that. Things could have been much MUCH different on Thursday, in which case I might still be in the server room right now.
|
Well, after playing around with the RAID array throughout the wee hours of the morning, it’s pretty apparent that something went seriously wrong. A massive power flux? A dicey hard drive? I really don’t know at this point. S.M.A.R.T. status on all the drives shows that they’re running just fine. So far 2 ideas are floating around in my head:
- The system suffered a massive power fluctuation that totally ticked off the Mac, or the RAID unit, or both.
- There is a major compatibility problem with the RAID unit and the HighPoint Technology card that I had to use in place of the bundled controller card. The only thing I can think of is that there might be 2 different chipsets between the RAID unit and the controller that don’t like each other at all.
Either way, I’m glad this problem decided to rear its head NOW instead of later, when we’ve got the drive populated with irreplaceable data.
Speaking of which, it just so happens that most of the files that were on that RAID5 array are still sitting in other areas! THAT is letting me breathe so much easier right now. However, there were quite a few Final Cut project files that were only on that array, which is still a bummer. I’m in the process of looking through data recovery software to see if there’s anything decent that I can try.
In the meantime, I’ve got a UPS set hook into that system immediately. Plus, if I can’t make the array work after another rebuild, I’m doing to set a separate PC up there and connect it to the MacPro. Since the PC has a PCI slot on it, I can use the Norco’s bundled controller card. I’m not exactly thrilled about putting a PC up there JUST to act as a bridge between the editing system and the RAID, but I may find that I have no choice.
More updates as the plot unfolds.
|
Last night around 6:00 we had a small power outage in our building. This is the first time that we’ve lost power at all since the building was under construction, and it lasted for less than a second. I wasn’t worried at all about our servers or phone system because everything network-related (servers, switches, both firewalls, cable modem, etc) is on a UPS.  I knew our desktops would receive a hard reboot, but just this once… shouldn’t hurt them, right?
Well, I came in this morning to find that 3 of those desktops weren’t doing too well. All 3 of them were EXTREMELY sluggish once their respective users logged in. I took a look at the processes running, and saw that svchost.exe was using +90% of the CPU. Very odd. I let the machines sit for a bit, and much to my relief they calmed down after awhile. The event logs haven’t turned up anything definite
After looking through MS’s support site, I’ve got an few things to check out, although I do have reservations about trying to replicate the incident. Was this incident related to the power loss? It would seem so, given the order of events. Intentionally cutting power on PCs is not my cup o’ tea though…perhaps we’ll try it on one of the cold spares. :-) If the incident WAS power-related, it really raises the case for setting a “UPS for every PC (or Mac)” policy. Under the right circumstances, dirty power could bring about a support nightmare for me and K. I’ve been hoping to avoid the UPS issue until next year, as we’re trying to run lean in 2007. Hopefully it can stay that way until 2008.
|
|
|