New Controller Card Issues

Home » CentOS » New Controller Card Issues
CentOS 14 Comments

Running CentOS 5 (long story, will be updated some day). A 5 yr old Dell PE R415. Whoever spec’d the order, they got the cheapest, embedded controller. Push came to shove – these things gag on a drive > 2TB. So, we bought some PERC H200’s for it, and its two mates. This morning, I brought the system down, and put in the card, and moved the SATA cables.

This did not end well.

The new card seemed to see the drives, it loaded the initrd… and kernel panic’d every time when it went to switch root. I went into the card’s firmware and set the first drive to boot. No change.

Upshot was that I actually had to pull the card – once it was in, and set, it insisted that it was 0 in boot order, and would not let me take it out, even though I’d reconnected the SATA connector to the on-board one.

So: my manager and I are suspecting that the initrd simply doesn’t have the driver for the card. I’m going to rebuild the initrd (once I figure out how to do that without the card in the box). The reason I’m posting is to ask y’all if anyone out there has some *other* thoughts as to what the problem might be, other than the initrd.

Thanks in advance.

mark

14 thoughts on - New Controller Card Issues

  • That’s the most likely explanation. Use “mkinitrd –with=mpt2sas ” (I think that’s the driver you want)

  • I ran into this a couple of years ago with some older 3Ware cards. A
    firmware update fixed it.

  • With 3ware cards depending on card model:

    1. the card supports drives > 2TB
    2. the card as initially released does not support these drives, but there is firmware update after installation of which the card will supports drives > 2TB
    3. The card does not support drives > 2TB, even with latest available firmware. (there are really old cards, even though someone may say hardware doesn’t live this long, I do have them in production as well, and to the credit of 3ware I must say: I’ve never seen one dead – excluding abuse/misuse of course).

    Just my $0.02

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • Well the 3Ware cards have reached their end of life for me. I just went through a horrible build that finally hinged on an incompatibility between a 3Ware card and a new SuperMicro X10-DRI motherboard. Boots would just hang. Once I installed a newer LSI MegaRaid card, all my problems went away.

    I have used and depended on 3Ware for a decade. I worked with both the
    3Ware vendor, which is Avago, and SuperMicro, there is simply no fix. I
    suggest everyone stay away from 3Ware from now on.

  • My experience has been that 3ware cards are less reliable that software RAID for a long, long time. It took me a while to convince my previous employer to stop using them. Inability to migrate disk sets across controller families, data corruption, and boot failures due to bad battery daughter cards eventually proved my point. I have never seen a failure due to software RAID, but I’ve seen quite a lot of 3ware failures in the last 15 years.

  • firmware update fixed it. drives > 2TB
    firmware. (there are really old cards, even though someone may say hardware doesn’t live this long, I do have them in production as well, and abuse/misuse of course). through a horrible build that finally hinged on an incompatibility between a 3Ware card and a new SuperMicro X10-DRI motherboard. Boots would just hang. Once I installed a newer LSI MegaRaid card, all my problems went away.

    Of two vendors 3ware and Supermicro (if it is hardware defining choice)
    3ware wind hands down for me. I gave up on Supermicro some time ago. I
    prefer Tyan system boards instead. Several supermicro boards died on me:
    all of them were AMD boards, dual or single socket workstation class boards. They died of age, somewhere in their 4th year of age, starting memory errors (and these are not due to mem controller which is on CPU
    substrate – I attest to that), and few Months later they would die altogether. I ascribed it to poor system board electronics design, and poor design of even one class of their products is a decision maker – at least for me. So, as far as system boards are concerned, I’m happy with Tyan, whose boards I’m using forever, who is in server board business forever (which was true even when Supermicro just emerged).

    As far as RAID cards are concerned… This is a bit more difficult thing. I do use both LSI and 3ware. 3ware outnumbers LSI in my server room by a factor of 5 at least. LSI is good solid hardware that exists forever
    (pretty much as 3ware does). For me big advantage of 3ware is transparent interface. By which I mean web interface. There is command line interface for both and 3ware command line interface may be less confusing for me. But transparent web interface available in 3ware case it real winner in my opinion. As, when dealing with (very good, which both are) RAID hardware the screw ups happen mostly due to operator error (OK, who of you guys do not screw up even with confusing interface are geniuses in my book – which I definitely am not).

    Sorry for long comment. I did feel 3ware deserves more respect than one might draw from this thread otherwise.

    Valeri

    3Ware vendor, which is Avago, and SuperMicro, there is simply no fix. I
    suggest everyone stay away from 3Ware from now on.

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • Kirk Bocek wrote:
    a) The old one, and the new, are LSI. It’s just that the original was
    *cheap*,
    bottom of the line. b) I’ve been told by Dell support that the old one is out of production, and
    there will *not* be any firmware updates to correct it, which is why we
    bought the PERC H200s.

    One more datum: running on the old controller, with the new removed (the only way I could get it up), lsmod showed me mptsas references. Searching for the PERC H200, I found a driver on Dell’s site… and part of its name was mpt2sas. So I’ve rebuilt the initrd, forcing it to include that.

    Now, talking to my manager, we’re thinking of firing up another box for the home directories, and then I can have some hours to resolve the problem, rather than having three people, including the other admin, sitting around unable to log on….

    mark
    mark

  • I strongly disagree with that.

    I have large number of 3ware based RAIDs in my server room. During last 13
    years or so I have never had any failures or data losses of these RAIDs –
    not due to hardware (knocking on wood, I guess I should start calling myself lucky).

    Occasionally some researchers (who come mostly from places where they self-manage machines) bring stories of disaster/data losses. Each of them when I looked deep into details turn out to be purely due to not well configured hardware RAID in the first place. So, I end up telling them:
    before telling others that 3ware RAID cards are bad and let you down, check that what you set up does not contain obvious blunders. Let me list the major ones I have heard of:

    1. Bad choice of drives for RAID. Any “green” (spin-down to conserve energy) drives are not suitable for RAID. Even drives that are not
    “spin-down” but poor quality, when they work in parallel (say 16 in single RAID unit) have much larger chance of more than one failing simultaneously. If you went as far as buying hardware RAID card, spend some 10% more on good drives (and buy them from good source), do not follow “price grabber”.

    2. Bad configuration of RAID itself. You need to run “verification” of the RAID every so often. My RAIDs are verified once a week. This will allow at least to force drives to scan the whole surface often, and discover and re-allocate bad blocks. If you don’t do if for over a year you will have fair chance RAID failure due to several drives failing (because of accumulated never discovered bad blocks) accessing particular stripe… then you loose your RAID with its data. This is purely configuration mistake.

    3. Smaller, yet still blunders: having card without memory battery backup and running RAID with the cache (in which case RAID device is much faster): if in this configuration you yank the power, you loose content of cache, and RAID quite likely will be screwed up big time.

    Of course, there are some restrictions, in particular, not always you can attach drives to different card model and have RAID keep functioning.
    3ware cards usually discover that, and they export RAID read-only, so you can copy data elsewhere, then re-create RAID so it is compatible with internals of this new card.

    I do not want to start “software” vs “hardware” RAID wars here, but I
    really have to mention this: Software RAID function is implemented in the kernel. That means you have to have your system running for software RAID
    to fulfill its function. If you panic the kernel, software RAID is stopped in the middle of what it was doing and haven’t done yet.

    Hardware RAID, to the contrary, does not need the system running. It is implemented in its embedded system, and all it need is a power. Embedded system is rather rudimentary and it runs only single rudimentary function:
    chops data flow and calculates RAID stripes. I’ve never heard this embedded system ever got panicked (which is result of its simplicity mainly).

    So, even though I’m strongly in favor of hardware RAID, I still consider one’s choice just a matter of taste. And I would be much happier if software RAID people will have same attitude as well ;-)

    Just my $0.02

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • OK. And I’ll tell you that none of the failures that I’ve seen in the last 15 years were a result of user error or poor configuration.

    None of our failures were due to drives.

    The same is true of software RAID, and enabled by default on RHEL/CentOS.

    I don’t know if that was ever allowed, but the last time I looked at a
    3ware configuration, you cannot enable write caching without a BBU.

    Yes, and that is one of the reasons I advocate software RAID. If I have a large data set on a 3ware controller, and that controller dies, then I
    can only move those disks to another system if I have a compatible controller. With software RAID, I don’t have to hope the controller is still in production. I don’t have to wait for delivery of a replacement. I don’t have to keep a spare, expensive equipment on hand.
    I don’t have to do a prolonged restore from backup. Any system running Linux will read the RAID set.

    I’m not saying that hardware RAID is bad, entirely. Software RAID does have some advantages over hardware RAID, but my point has been that
    3ware cards, specifically, are less reliable than software RAID, in my experience.

  • If I get you correctly you are saying that 3ware RAID cards are prone to hardware failures – as opposed to software RAID which is not (as it does not include hardware, so never has a hardware failures), right? No this is a joke of course. But it’s the one one asked for ;-)

    Now, seriously: of more than a couple of dozens of cards I used during last about 13 years not a single one died on me. Still I do not consider myself lucky. I also do not have specifically excellent conditions in a server room. Just ordinary ones. Sometimes during air conditioning maintenance temperature in the server room is higher than in our offices
    (once we has 96F for a couple of hours, of course this was a unique case. BTW during these 2 hours none of AMD based boxes got sick. But a few Intel ones did). So, I have no special conditions for our equipment. Boxes are behind APC UPSes, that’s true, but that is sort of nothing special. I also probably should mention that all 3ware cards I used were bought new, never used by someone else. Couple of dozens is decently representative selection statistically, so I probably can’t just claim myself lucky. I’m still convinced that 3ware RAID cards _ARE_ reliable.

    Again, I don’t know your statistics: i.e. how many did you use, how many of them died (failed as hardware). How well surge-free the power is. How well the guys who installed your cards into your boxes followed static discharge precautions (yes, even though these cards are robust, one may fry them slightly by static discharge, and they may fail – as many fried ICs they not always fail right away, but some time later, say, in a year;
    still as a result of static discharge). To be fair, I’m often not wearing that anti-static bracelet, but I do touch metal tars of the box, anti-static bag the card ships in etc – which is sufficient IMHO.

    So, still not considering myself lucky, yet never having 3ware card die on me (out of over couple of dozens, during over 13 years, and most of the cards are in service for some 8-10 years since they originally were installed). In my book 3ware RAID cards are reliable hardware.

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • Software RAID uses whatever controller the disks are connected to. Often, that’s the AHCI controller on the motherboard. And those controllers are almost always more reliable than 3ware. Seriously. No joke.

    Hundreds of systems, both with 3ware and with software RAID.

    Not many, but some. A handful of data corruption cases (maybe 5? I
    don’t have logs). One BBU failure that resulted in a system that wouldn’t boot until it was removed (and a couple of hours down time while 3ware techs worked that ticket). A couple of times when we really wanted to move an array of disks and couldn’t.

    Vs zero reliability issues with software RAID.

    We always used trusted UPS hardware.

    Our systems were built by a professional VAR. Employees always wore ground straps. There were other anti-static measures in place as well.
    I had been to the facility.

    And I’ll throw you a curveball:

    I’ve argued that software RAID is more reliable than 3ware cards, specifically. In the world of ZFS and btrfs, you absolutely should not use hardware RAID. As you mentioned earlier, hardware RAID volumes should scan disks regularly. However, scanning only tells you if disks have bad sectors. It can also detect data (parity) errors, but it can’t repair them. Hardware RAID cards don’t have information that can tell them which sectors are correct. ZFS and btrfs, on the other hand, checksum their data and metadata. They can tell which sectors are damaged in the case of corruption such as bit flips, and which sector should be used to repair the data.

    Now, btrfs and ZFS are completely different from software RAID. I’m not comparing the two. But hardware RAID simply has too many deficiencies to justify its continued use. It should go away.

  • Both are now owned by Avago. (Not sure when that happened, last I
    looked LSI was its own company.)

    I find the 3ware CLI a little clunky but easy to understand. I find the LSI CLIs (both MegaCLI and storcli) incredibly confusing, and the GUI
    interface is not intuitive (and I think doesn’t expose all the information about the controller; the Nagios LSI plugin found errors that I could find no trace of in the GUI).

    Agreed; I’ll share some of my experiences in another post.

    –keith

  • I have had a couple dozen hardware RAID controllers over the years. I
    have not had the success you’ve had, but I’ve had very few hard failures. I have had one data loss event, where a bad BBU was causing problems with an old controller (of course I had backups, as should everyone). I’ve had two other controllers just die, but replacing it was easy, and the new controller recognized the arrays immediately. This includes moving two different disk arrays from two different 9650s to two different 9750s, so whoever wrote that arrays are not compatible across different models is at least partly incorrect.

    I do also have an LSI controller, which has been fine, but it’s only one controller so it’s not enough data points to draw any conclusions. I also have an md RAID array (on a very old 3ware controller which doesn’t support RAID6), and it’s also been fine. It hasn’t suffered through any major catastrophes, though I do think it’s had one or two fatal kernel panics, and once or twice had a hard reset done. It’s still fine even with a small number of really crappy “green” drives still in the array
    (I learned that lesson the hard way–don’t use green drives with a hardware RAID controller!).

    –keith