Hardware Raid Health?

Home » CentOS » Hardware Raid Health?

August 25, 2014 Les Mikesell CentOS 6 Comments

I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died… What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers – and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM’s – and also recent HP servers?

6 thoughts on - Hardware Raid Health?

Digimer says:

August 25, 2014 at 3:08 pm

IBM used LSI-based controllers, I believe.

For our monitoring, we wrote a little script that calls MegaCli64 every
30 seconds and checks for changes. If anything of note changes (drive health, BBU/FBU issues, temperature issues, etc) it sends us an email. It would be fairly easy to do the same for hpacucli, I would imagine.

Unfortunately, though it’s all open source, it’s part of a package that monitors a pile of things (including IPMI sensors, APC UPSes, Red Hat HA
stack, etc), so it wouldn’t be drop-in-and-go. That said, you could probably fairly easily strip it down if you wanted to use it, too.

If you’re curious, I show how to set it up here. If you’re comfortable with perl, it’ll be pretty easy to adapt, I suspect.

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Setting_Up_Alerts

Cheers
Jason Pyeron says:

August 25, 2014 at 3:11 pm

We use MegaCLI, but it has the risk of hanging the box (observed only once).

Just changed out a drive last night because of it.

-Jason
Digimer says:

August 25, 2014 at 3:23 pm

Can you share any detail on this? Controller/drive model? MegaCli version? How exactly did it lock up?

I use it extensively so this worries me. :)
John R says:

August 25, 2014 at 3:40 pm

IF megacli64 works for this raid controller, then I tweaked some python scripts I found online and use these two scripts.. these live in
/root/bin as they are only for root’s use.

here’s the typical output of the first script…

[root@server1 bin]# lsi-raidinfo
— Controllers
Jason Pyeron says:

August 25, 2014 at 3:52 pm

Locked up the OS, not the array. Power cycled after the array synced the new drive 6 hours later.

On a Dell PE2970
Product Name : PERC 6/i Integrated FW Package Build: 6.2.0-0013

Mfg. Data
===============Mfg. Date : 06/24/08
Rework Date : 06/24/08
Revision No :
Battery FRU : N/A

Image Versions in Flash:
===============FW Version : 1.22.02-0612
BIOS Version : 2.04.00
WebBIOS Version : 1.1-46-e_15-Rel Ctrl-R Version : 1.02-015B
Preboot CLI Version: 01.00-023:#%00006
Boot Block Version : 1.00.00.01-0011

MegaCLI SAS RAID Management Tool Ver 8.05.71 Apr 30, 2013

$ while MegaCli64 -PDRbld -ShowProg -PhysDrv [32:1] -aALL; do sleep 1; done

The sleep 1 was abusive!
Keith Keller says:

August 25, 2014 at 4:28 pm

They can probably go anywhere, since a normal user won’t have the permissions to open the proper devices anyway.

I use slightly modified versions of these scripts with Nagios. I
haven’t had a drive fail yet (so one is sure to fail in the next day or two), but the scripts worked when the chiller in the room failed and the temperature spiked–they notified me that the internal temperatures of the ROC and the drives were all too high.

There is a GUI to the MegaRAID controllers available. I seldom use it so I can’t give too much information about it.

If the OP’s servers use a different controller there may still be scripts like these, just let us know what the hardware is. (I know they exist for 3ware, I think they may for Areca.)

–keith

Hardware Raid Health?

6 thoughts on - Hardware Raid Health?

Recommended

Recent Posts

Recent Comments

Archives

Categories

Meta