OT, Hardware: HP Smart Array Drive Issue

Home » CentOS » OT, Hardware: HP Smart Array Drive Issue
CentOS 4 Comments

Hi. Anyone working with these things? I’ve got a drive in “predictive failure” on in a RAID5. Now here’s the thing: there was an issue yesterday when I got in, and I wound up power cycling the RAID; first boot of attached server had issues, and said the controller had a failure, and a drive had failed, and wouldn’t continue booting; when I gave it the three-finger salute, this time on the way up, during POST, it noted the controller issue… but the thing came up, looking like it did a couple of days ago.

Trying to prevent this from happening again, I’ve decided to replace the drive that’s in predictive failure. The array has a hot spare. I tried to remove, using hpacucli, it refuses “operation not permitted”, and there doesn’t *seem* to be a “mark as failed” command. *Do* I just yank the drive?

mark

4 thoughts on - OT, Hardware: HP Smart Array Drive Issue

  • Jason Warr wrote:

    Thanks for your quick reply, Jason. I’m used to LSI/MegaRAID/PERCs, where you have to fail it, first. Oddity: I had the drive out for more then five minutes while getting it out of the sled, putting the new one in, oh, and dusting out the slot (gotta do that for all of them, next maintenance window), but after I put in the replacement, and used hpacucli to check, to my surprise it was rebuilding with the replacement, *not* with the spare.

    mark

  • It has been a while since I have used a spare but what might have happened is the spare went back to being a spare when the real drive was replaced. It seems to me that is the default behavior as a spare can be attached to more than one raid group. That way it keeps your physical drive placement consistent.

  • HP’s raid controllers appears to have some logic that if the rebuild to spare disk have not yet reached 50% when you insert the replacement, it will abandon the rebuild to the spare and rebuild to the replacement instead.

    I don’t have any documentation to prove it, but I have observed it numerous of times.

    Thomas

  • Hi Mark, I’ve never had any problem just pulling and replacing drives on HP hardware with the hardware RAID controllers (even the icky cheap one that came out around the DL360/380 Gen 8 timeframe, that isn’t really hardware RAID and needs closed drivers in Linux).

    That said, I also *test it*, long before putting anything important on them…

    From past experience with HP stuff, it usually won’t move the data over to the hot spare (especially if it’s a “Global” hot spare and not specific to that array) until an actual failure occurs. “Predictive failure” isn’t considered a failure in HP’s world. I don’t think there is any setting to tell the controller to move to the hot spare if there’s a “predictive failure”.

    I’ve also had disks that triggered a “predictive failure” under heavy load that were simply popped out and back in, and the controller rebuilt them, and the drive never did it again for *years*. The “predictive failure” error rate is pretty low.

    That last one is more a question of policy than anything. How much do you trust it? At one employer the game was to pop out and back in any drive that showed “predictive failure” on HP systems (Dell stuff we handled differently at the time, it was less prone to false alarms, so to speak) and if they did it again “soonish”, we’d call for the replacement disk. That’s how often the HP controllers did it. In a rather large farm of HP stuff, I popped and replaced an HP drive a week, whenever I happened by the data center.

    As for the question of whether you should be able to do it safely or not… if a hardware RAID controller won’t let me yank a physical drive out and shove another one in and rebuild itself back to whatever level of redundancy was defined by me as “nominal” for that system, I don’t want it anyway. Look at it this way… if the disk had a catastrophic electronics failure while installed in the array, the array should handle it… yanking it out is technically nicer than some of the failure modes that can affect the busses on the backplane with shorted electronics. (GRIN)

    Just sharing my thoughts… your call. :-) YMMV. We had a service contract at that place and a new disk was always just a phone call away and no additional $, and even with that level of service, we always did the “re-seat it once” thing. We’d log it and if anyone else saw that same disk flashing the next time they were at the data center (we just looked at the logged ones before doing the “re-seat”), they’d make the phone call and the service company would drop a drive off a few hours later.


    Nate Duehr denverpilot@me.com