Hardware Raid Health?

Home » CentOS » Hardware Raid Health?
CentOS 6 Comments

I just had an IBM in a remote location with a hardware raid1 have both drives go bad. With local machines I probably would have caught it from the drive light before the 2nd one died… What is the state of the art in linux software monitoring for this? Long ago when that box was set up I think the best I could have done was a Java GUI tool that IBM had for their servers – and that seemed like overkill for a simple monitor. Is there anything more lightweight that knows about the underlying drives in a hardware raid set on IBM’s – and also recent HP servers?

6 thoughts on - Hardware Raid Health?

  • IBM used LSI-based controllers, I believe.

    For our monitoring, we wrote a little script that calls MegaCli64 every
    30 seconds and checks for changes. If anything of note changes (drive health, BBU/FBU issues, temperature issues, etc) it sends us an email. It would be fairly easy to do the same for hpacucli, I would imagine.

    Unfortunately, though it’s all open source, it’s part of a package that monitors a pile of things (including IPMI sensors, APC UPSes, Red Hat HA
    stack, etc), so it wouldn’t be drop-in-and-go. That said, you could probably fairly easily strip it down if you wanted to use it, too.

    If you’re curious, I show how to set it up here. If you’re comfortable with perl, it’ll be pretty easy to adapt, I suspect.

    https://alteeve.ca/w/AN!Cluster_Tutorial_2#Setting_Up_Alerts

    Cheers

  • We use MegaCLI, but it has the risk of hanging the box (observed only once).

    Just changed out a drive last night because of it.

    -Jason

  • Can you share any detail on this? Controller/drive model? MegaCli version? How exactly did it lock up?

    I use it extensively so this worries me. :)

  • IF megacli64 works for this raid controller, then I tweaked some python scripts I found online and use these two scripts.. these live in
    /root/bin as they are only for root’s use.

    here’s the typical output of the first script…

    [root@server1 bin]# lsi-raidinfo
    — Controllers

  • Locked up the OS, not the array. Power cycled after the array synced the new drive 6 hours later.

    On a Dell PE2970
    Product Name : PERC 6/i Integrated FW Package Build: 6.2.0-0013

    Mfg. Data
    ===============Mfg. Date : 06/24/08
    Rework Date : 06/24/08
    Revision No :
    Battery FRU : N/A

    Image Versions in Flash:
    ===============FW Version : 1.22.02-0612
    BIOS Version : 2.04.00
    WebBIOS Version : 1.1-46-e_15-Rel Ctrl-R Version : 1.02-015B
    Preboot CLI Version: 01.00-023:#%00006
    Boot Block Version : 1.00.00.01-0011

    MegaCLI SAS RAID Management Tool Ver 8.05.71 Apr 30, 2013

    $ while MegaCli64 -PDRbld -ShowProg -PhysDrv [32:1] -aALL; do sleep 1; done

    The sleep 1 was abusive!

  • They can probably go anywhere, since a normal user won’t have the permissions to open the proper devices anyway.

    I use slightly modified versions of these scripts with Nagios. I
    haven’t had a drive fail yet (so one is sure to fail in the next day or two), but the scripts worked when the chiller in the room failed and the temperature spiked–they notified me that the internal temperatures of the ROC and the drives were all too high.

    There is a GUI to the MegaRAID controllers available. I seldom use it so I can’t give too much information about it.

    If the OP’s servers use a different controller there may still be scripts like these, just let us know what the hardware is. (I know they exist for 3ware, I think they may for Areca.)

    –keith