LSI MegaRAID Experience…

Home » CentOS » LSI MegaRAID Experience…
CentOS 10 Comments

Hey,

anyone using an LSI MegaRAID experienced “disappearing drives”…?
We installed 6 new C6 servers, each with a Supermicro SMC2108 (LSI MegaRAID) controllers and 3 PX-128M5Pro SSDs ( RAID1 + hostswap).
2 weeks (and almost no activity on it, since not in production, apart from installation) later, megacli sees (based on the slot numbers):
– on one server: only the 2nd disk of the RAID + the ex-hotswap brought online. NO sign of the first disk, no error message, it just “disapeared”…
– on another server: apparently lost 1 of the 2 RAID1 disks and then the second disk of the RAID1…

10 thoughts on - LSI MegaRAID Experience…

  • I have about 20 servers running CentOS 6 with LSI RAID controllers, all using MegaCLI64 and I have not have problems. I did have a problem with a white-box using a SM chassis and I found on occasion that one node would periodically fail to see a couple drives on boot though. In this case, a cold power off fixed the problem. I only have two SM chassis though, so my sample-set is low.

    Given this though, I’d look at the SM back-plane before the LSI
    controller itself.

  • John Doe wrote:

    You’re saying that if you use megacli, it doesn’t show the physical drive?

    As someone else said, I’d look at the SM box: I have *nothing* but very bad experience with SM’s quality control (try sending 4? 5? 6? boxes out of 20 back for repair from Penguin, a vendor that’s all SM).

    mark “that doesn’t count the couple or so sent back *twice*”

  • We run 3 CentOS6+MR9286-8e systems and we’ve always been able to see the drives via cli. The only time a system didn’t report drives was one of the cards had the metal bracket be a little bit not straight such that if you screwed down the bracket, the back of the card lifted a bit. When moving the box, it jolted that card enough that upon boot the motherboard no longer saw the card at all and of course not the drives hanging off it either… straightening the bracket fixed that problem. We’ve seen this problem two or three times out of the 8 or so cards we’ve had (the original configuration used more cards than the current one, four cards are so much cheaper than 8! =P)

    Now, we’ve tested the systems heavily, “killing” drives, changing things around, etc, and never had a drive vanish unless it was pulled out of the connector, so, no, megacli does not hide dead drives.

    Thanks!
    Miranda

  • I helped deploy a couple petabytes of storage behind LSI MegaRAID SAS
    9260-8i’s which is the same card. never had any storage disappear.

  • The thing that bothers me is that the ctrl sees all the drives at first, later

    does not see some anymore, and he just “forgets” about them like they never existed. I would have expected to still see them but in a failed state… Here, megacli just lists info for the remaining drive(s). So I miss all the “post mortem” info like the SMART status or the error counts if they had any… Am I missing an option to add to megacli to show the failed ones too maybe?
    Having used HP raid ctrls, I am used to see all drives, even failed ones.

    Anyway, I”ll have to check the drives, backplane and cabling…

    Thx for all the answers, JD

  • From: Drew Weaver

    Not sure about TLER on those Plextors… This is what megacli says:
    ————————————–

  • Not sure about TLER on those Plextors… This is what megacli says:
    ————————————–

  • John Doe wrote:

    TLER would only show up on something that looks at a *very* low level on the physical drive. What I know is that you can see it with smartctl –
    from the man page: scterc[,READTIME,WRITETIME] – [ATA only] prints values and
    descriptions of the SCT Error Recovery Control settings. These
    are equivalent to TLER (as used by Western Digital), CCTL
    (as
    used by Samsung and Hitachi) and ERC (as used by Seagate). READ-
    TIME and WRITETIME arguments (deciseconds) set the specified
    values. Values of 0 disable the feature, other values less than
    65 are probably not supported. For RAID configurations, this is
    typically set to 70,70 deciseconds.

    Note that knowing this was the result of a *lot* of research a couple-or so years ago. One *good* thing *seems* to be WD’s new Red line, which is targeted toward NAS, they say… because they’ve put TLER back to something appropriate, like 7 sec or so, where it was 2 *minutes* for their “desktop” drives, and they disallowed changing it in firmware around
    ’09, and the other OEMs followed suit. What makes Red good, if they work, is that they’re only about one-third more than the low-cost drives, where the “server-grade” drives are 2-3 *times* the cost (look at the price of Seagate Constellations, for example).

    mark

  • John Doe wrote:

    TLER would only show up on something that looks at a *very* low level on the physical drive. What I know is that you can see it with smartctl – from the man page: scterc[,READTIME,WRITETIME] – [ATA only] prints values and
    descriptions of the SCT Error Recovery Control settings. These
    are equivalent to TLER (as used by Western Digital), CCTL (as
    used by Samsung and Hitachi) and ERC (as used by Seagate). READ-
    TIME and WRITETIME arguments (deciseconds) set the specified
    values. Values of 0 disable the feature, other values less than
    65 are probably not supported. For RAID configurations, this is
    typically set to 70,70 deciseconds.

    Note that knowing this was the result of a *lot* of research a couple-or so years ago. One *good* thing *seems* to be WD’s new Red line, which is targeted toward NAS, they say… because they’ve put TLER back to something appropriate, like 7 sec or so, where it was 2 *minutes* for their “desktop” drives, and they disallowed changing it in firmware around ’09, and the other OEMs followed suit. What makes Red good, if they work, is that they’re only about one-third more than the low-cost drives, where the “server-grade” drives are 2-3 *times* the cost (look at the price of Seagate Constellations, for example).

LEAVE A COMMENT