RAID 6 – Opinions

Home » CentOS » RAID 6 – Opinions
CentOS 27 Comments

I’m setting up this huge RAID 6 box. I’ve always thought of hot spares, but I’m reading things that are comparing RAID 5 with a hot spare to RAID
6, implying that the latter doesn’t need one. I *certainly* have enough drives to spare in this RAID box: 42 of ’em, so two questions: should I
assign one or more hot spares, and, if so, how many?

mark

27 thoughts on - RAID 6 – Opinions

  • From: Joseph Spenner

    Also, if you lose a disk, the RAID6 can lose a second disk anytime without problem. The RAID5 cannot until the hot spare has fully replaced the dead disk (which can take a while). And, I believe RAID6 algorithm might be (a little) more demanding/slow than RAID5. Check also RAID50 and 60 if your controller permits it…

    JD

  • I was building a home NAS over the holidays and had the same question
    (well, not hot spare, but 5 vs. 6). A good friend on mine pointed me to the following article;

    http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162

    I was using 6x 3 TB drives, so I decided to opt for RAID 6. About a month ago, a drive cacked out and I was *very* relieved to know that I
    was covered until I replaced the disk and it finished rebuilding.

    If you have 42 disks, I’d not even think twice and I would use RAID
    level 6. If fact, with such a large number, I’d almost be tempted to break it into two separate RAID level 6 arrays and use something like LVM to pool their space, just to hedge my bets.

  • John’s First Rule of Raid. when a drive fails 2-3 years downstream, replacements will be unavailable. If you had bought cold spares and stored them, odds are too high they will be lost when you need them.

    John’s Second Rule of Raid. No single raid should be much over 10-12
    disks, or the rebuild times become truly hellacious.

    John’s Third Rule of Raid. allow 5-10% hot spares.

    so, with 42 disks, 10% would be ~4 spares, which leaves 38. 5% would be
    2 spares, allowing 40 disks.

    40 divided by 4 == 10. You could format that as 10 raid6’s, and stripe those (aka raid6+0 or raid60), and use 2 hot spares. Alternately, 3*13 == 39, leaving three hotspares, so 3 stripes of 13
    disks with 3 hot spares is an alternative.

    I did some testing of very large raids using LSI Logic 9261-8i MegaRAID
    SAS2 cards driving 36 3TB SATA3 disks. With 3 x 11 disk RAID6 (and 3
    hot spares), a failed disk took about 12 hours to restripe with the rebuilding set to medium priority, and the raid essentially idle.

    if you’re using XFS on this very large file system (which I *would*
    recommend), do be sure to use a LOT of ram, like 48GB… while regular operations might not need it, XFS’s fsck process is fairly memory intensive on a very large volume with millions of files.

  • As another poster mentioned, I’d even break this up into multiple RAID6
    arrays. One big honking 42 drive array, if they’re large disks, will take forever to rebuild after a failure.

    With this many drives, I’d designate at least one as a global spare anyway. Yes, you lose some capacity, but you have even more cushion if, say, you’re out of town for a week, a drive fails, and your backup person is sick. One possible configuration is to create three RAID6
    arrays with 11 drives each (or one or two with 12 instead), and group them using LVM. You could also simply create one RAID6 with the capacity you need for the next few months, then create new arrays and add them to your volume group as you need them. This has the added bonus that you look like a genius for deploying new capacity so quickly. :) Recently I acquired a half-empty storage array, so that I can add larger drives as they become available instead of being tied to drive sizes of today.

    I seem to remember reading on the linux RAID mailing list that, at least for linux md RAID6 (which the OP may not be using), performance on a RAID6
    with one missing drive is slightly worse than optimal RAID5. I could be wrong however, and perhaps a hardware RAID controller doesn’t have this deficiency.

    –keith

  • John R Pierce wrote:
    enough

    Ok, listening to all of this, I’ve also been in touch with a tech from the vendor*, who had a couple of suggestions: first, two RAID sets with two global hot spares.

    I’ve just spoken with my manager, and we’re going with that, then one of the tech’s other suggestions was three volume sets on top of the two RAID
    sets, so we’ll have what look like three drives/LUNs of about 13+TB each.

    All your comments were very appreciated, and gave me a lot more confidence in this setup. We will be using ext4, btw – I don’t get to try out XFS on this $$$$$$ baby.

    mark

    * Unpaid plug: we bought this from AC&NC: their own price was cheaper than either of the two resellers I spoke to (three quotes required), they seem pretty hungry (but have been around a while, given the number of old boxes we have), and they respond *very* quickly to problems for support.

  • m.roth@5-cent.us wrote:


    Followup comment: I created the two RAID sets, then started to create the volume sets… and realized I didn’t know if it was *possible*, much less desirable, to have a volume set that spanned two RAID sets. Talked it over with my manager, and I redid it as three RAID sets, one volume set each.

    Maybe the initialization will be done tomorrow….

    mark

  • I would test how long a drive rebuild takes on a 20 disk RAID6. I
    suspect, very long, like over 24 hours, assuming a fast controller and sufficient channel bandwidth.

  • sure. throw all the RAIDs into a single LVM volume group, and then stripe 3 logical volumes’s across that volume group

  • —– Original Message —–

    Just for reference, I have a 24 x 2TB SATAIII using CentOS 6.4 Linux MD RAID6 with two of those 24 disks as hotspares. The drives are in a Supermicro external SAS/SATA box connected to another Supermicro 1U computer with an i3-2125 CPU @ 3.30GHz and 16GB ram. The connection is via a 6Gbit mini SAS cable to an LSI 9200 HBA. Before I deployed it into production I tested how long it would take to rebuild the raid from one of the hot spares and it took a little over 9 hours. I have two 15TB LVM’s on it formatted EXT4 with the rest used for LVM snapshot space if needed. Using dd to write a large file to one of the partitions I see about 480MB/s. If I rsync from one partition to another I get just under 200MB/s. dd if=/dev/zero of=/backup/5GB.img count=5000 bs=1M
    5000+0 records in
    5000+0 records out
    5242880000 bytes (5.2 GB) copied, 10.8293 s, 484 MB/s

    David.

  • I did a similar test on a 3ware controller. Apparently those cards have a feature that allows the controller to remember which sectors on the disks it has written, so that on a rebuild it only reexamines those sectors. This greatly reduces rebuild time on a mostly empty array, but it means that a good test would almost fill the array, then attempt a rebuild. I definitely saw a difference in rebuild times as I filled the array. (In 3ware/LSI world this is sometimes called “rapid RAID
    recovery”.)

    In checking my archives, it looks like a rebuild on an almost full 50TB
    array (24 disks) took about 16 hours. That’s still pretty respectable. I didn’t repeat the experiment, unfortunately.

    I don’t know if your LSI controller has a similar feature, but it’s worth investigating.

    –keith

  • yeah, until a disk fails on a 40 disk array and the chassis LEDs on the backplane don’t light up to indicate which disk it is and your operations monkey pulls the wrong one and crash the whole raid.

    have fun with that!

    if you can figure out how to get the drive backplane status LEDs to work on Linux with a ‘dumb’ controller plugged into a drive backplane, PLEASE
    WRITE IT UP ON A WIKI SOMEWHERE!!! everything I’ve seen leaves this gnarly task as an exercise to the reader. With a card like a 9261-8i, it just works automatically.

    also, hardware raid controllers WITH battery backed (or flash backed)
    cache can greatly speed up small block write operations like directory entry creates, database writes, etc.

  • Besides performance, the longer your rebuild takes, the more vulnerable you are to additional disk failure taking out your array. We’ve lost arrays that way in the past, pre-RAID6, lost two disks within a 6-hour period, and there went the array since the rebuild wasn’t complete. RAID6 means you can handle 2 disk failures, but the third one will drop your array, if I’m remembering correctly. And the larger the number of disks, the higher the chance that you’ll have disk failures…

    Thanks!
    Miranda

  • Yes, and yes. But different configurations of other RAID levels will give you different levels of protection–not “better” or “worse”, because that needs to be evaluated in context.

    For example, as has been noted, RAID6 can lose up to two drives, and the third lost drive loses the array [0]. A 12-drive RAID10, with six two-drive RAID1 components, can lose up to six drives, but only the right six drives–losing both drives of one RAID1 loses the entire array. On the other side of things, rebuilding a 12-drive RAID6 will take much longer than rebuilding one RAID1 component of a RAID10. And as one more example, a 12-drive RAID50, with three four-drive RAID5
    components, can lose up to three drives, one from each component, but two drives from one RAID5 loses the array. Rebuild times will be longer than RAID10 but shorter than RAID6. (There are also performance questions, which I know little about.)

    RAID6 is certainly the most efficient way, space-wise, to allocate drives such that you can lose up to two drives before losing the array. So if maximizing storage space is the primary concern, greater than performance, RAID6 is likely the best choice. But, as is often repeated here, on the md RAID list, and elsewhere, ***RAID IS NOT A BACKUP
    SOLUTION!!!*** If you care about your data you need to back it up elsewhere. Do *not* rely solely on RAID to keep your data safe! All sorts of bad things can happen: a flaky controller can cause filesystem problems, and a badly defective controller can completely destroy the array. RAID allows you to tolerate some failure, but it can’t save your data from catastrophe.

    –keith

    [0] “loses the array” here means that it won’t be mountable without some sort of expensive drive recovery process.

  • You simply match up the Linux /dev/sdX designation with the drives serial number using smartctl. When I first bring the array online I have a script that greps out the drives serial numbers from smartctl and creates a neat text file with the mappings. When either smartd or md complain about a drive I remove the drive from the RAID using mdadm and then pull the drive based on the mapping file. Drive 0 in those SuperMicro SAS/SATA arrays are always the lowest drive letter and goes up from there. If a drive is replaced I just update the text file accordingly. You can also print out the drive serial numbers and put them on the front of the removable drive cages. It is not as elegant as a blinking LED but it works just as well. I have been doing it like this for 6 plus years now with a few dozen SuperMicro arrays. I have never pulled a wrong drive.

    David.

  • we use several of this kind of boxes (but with 45 trays) and our experience was that the optimum volume size was 12 hdds (3 X 12 + 9)
    which will reduce the 45 disks to a actual size of 37 disks (a 12 disk volume is 40 TB size … in event of a broken hdd it takes 1 day to recover.. more than 12 disks and i dont (want to) know how long it would take) and we don’t use hot spares.

    HTH, Adrian

  • [snip]

    I think that there is at least one potential problem, and possibly more, with your method.

    1) It only takes once forgetting to update the mapping file to screw things up for yourself. Some people are the type who will never forget to do that. I’m (unfortunately) not. (Actually, I guess it takes twice, since if you have only one slot not up to date, you could use the serial numbers to map all but the one drive, and that’s the suspect drive. I wouldn’t want to trust that process.)

    2) Drive assignments can be dynamic. If you pull the tray in port 0, which was sda (for example), you’re not necessarily guaranteed that the replacement drive will be sda. It might be assigned the next available sdX. I have seen this in certain failure situations. (As an aside, how does the kernel handle more than 26 hard drive devices? sdaa? sdA?)

    1a and 2a) Printing serial numbers and taping them to the tray is much less error-prone, but also more time consuming. If you have a label printer that certainly makes things easier.

    3) If you have someone else pulling drives for you, they may not have access to the mapping file, and/or may not be willing or under contract to print a new tray label and replace it. It’s way less error-prone to tell an “operations monkey” to pull the blinky drive than to hope you read the mapping file correctly, and relay the correct location to the monkey. (The ops monkey may not have login rights on your server, so you also can’t rely on him being able to look at the mapping file himself.) If you’re the only person who will ever pull drives, this isn’t such a huge problem.

    That’s not to say that your methods can’t work–obviously they can if you haven’t had any mistakes in many years. But the combination of a BBU-backed write cache and an identify blink makes a dedicated hardware RAID controller a big win for me. (I do also use md RAID, even on hardware RAID controllers, where flexibility and portability are more important than performance.)

    –keith

  • sdaa, sdab, sdac, … sdba, sdbb, sdbc…. etc etc.

    and yes, if you have 40+ disks as JBOD, its a bloody mess, especially if linux udev starts getting creative.

    many of the systems I design get deployed in remote DCs and are installed, managed and operated by local personnel where I have no clue as to the levels of their skills, so its in my best interest to make the procedures as simple and failsafe as possible. when faced with a wall of 20 storage servers, each with 48 disks, good luck with finding that
    20 digit alphanumeric serial number “3KT190V20000754280ED” … uh HUH, thats assuming all 960 disks got just the right sticker put on the caddies. ‘replace the drives with the red blinking lights’ is much simpler than ‘figure out what /dev/sdac is on server 12’

  • We build a storage unit that anyone using CentOS can build. It is based on the 3ware 9750-16 controller. It has 16 x 2 TB Sata 6 gb/s disks. We always set it up as a 15 disk RAID 6 array and a hot spare. We have seen multiple instances were the A/C has gone off but the customer’s UPS kept the systems running for an hour or two with no cooling.

  • Hi, Seth,

    Seth Bardash wrote:

    Interesting. We’re still playing with sizing the RAID sets and volumes. The prime consideration for this is that the filesystem utils still have problems with > 16TB (and they appear to have been saying that fixing this is a priority for at least a year or two ), so we wanted to get close to that.

    This afternoon’s discussion has me with 2 17-drive RAID sets and a 6; on that, one volume set each, which gives us 30TB usable on the two large RAID sets and 8TB usable on the small one, all RAID 6.

    That’s not a problem for us – we’ve got two *huge* units in the server room, er, computer lab that this RAID box is in.

    THANK YOU. I’ll forward this email to my manager and the other admin –
    it’s really helpful to know this, even if it isn’t your box. What came with this are QLogic FC HBAs for the server.

    mark

  • I’ve had no issues with 64bit CentOS 6.2+ on 81TB (74TiB) volumes using GPT and XFS(*), with or without LVM.

    (*) NFS has an issue with large XFS volumes if you export directories below the root and those directories have large inode numbers. the workarounds are to either precreate all directories that are to be exported before filling up the disk, or specify arbitrary ID#’s less than 2^32 on the NFS export using fsid=nnn in the /etc/exports entry for these paths (these ID values have to be unique on that host). this is a stupid bug in NFS itself that gets triggered by XFSs use of 64bit inode numbers.

  • hi,

    that is why I put a label on every drive tray that is visible without pulling the disk. That label carries the serial number, so that the monkey can double check the disk serial before pulling it. In fact, I
    was the silly monkey once, so I am careful now :-)

    best regards

  • It’s great the the Supermicro controllers can do this, but I know from experience that in the general case with multiple controllers and on CentOS 6 this will not work. Just a quick caveat on that…..

  • This is what I love about real RAID controllers or real storage array systems, like NetApp, EMC, and others. Not only does the faulted drive light up amber, but the shelf/DAE also lights up amber.

    I told an EMC VP a week or so ago that ‘anybody can throw a bunch of drives together, but that’s not what really makes an array work.’ The software that alerts you and does the automatic hotsparing (even across RAID groups (using EMC terminology)) is where the real value is. A
    bunch of big drives all lopped together can be a pain to troubleshoot indeed.

    I’ve done arrays with a bunch of COTS drives; and I’ve done EMC. Capex is easier to justify than opex in a grant-funded situation, and that’s why in 2007 we bought our first EMC Clariions (44TB worth, not a lot by today’s standards), since the grant would fund the capex but not the opex, and I’ve not regretted it once since. One of those Clariion CX3-10c’s has been continuously available since placed into service in
    2007, even through OS (EMC FLARE) upgrades/updates and a couple of drive faults.

LEAVE A COMMENT