Booting Software RAID

Home » CentOS » Booting Software RAID
CentOS 49 Comments

I installed CentOS 6.x 64 bit with the minimal ISO and used two disks in RAID 1 array.

Filesystem Size Used Avail Use% Mounted on
/dev/md2 97G 918M 91G 1% /
tmpfs 16G 0 16G 0% /dev/shm
/dev/md1 485M 54M 407M 12% /boot
/dev/md3 3.4T 198M 3.2T 1% /vz

Personalities : [raid1]
md1 : active raid1 sda1[0] sdb1[1]
511936 blocks super 1.0 [2/2] [UU]
md3 : active raid1 sda4[0] sdb4[1]
3672901440 blocks super 1.1 [2/2] [UU]
bitmap: 0/28 pages [0KB], 65536KB chunk md2 : active raid1 sdb3[1] sda3[0]
102334336 blocks super 1.1 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk md0 : active raid1 sdb2[1] sda2[0]
131006336 blocks super 1.1 [2/2] [UU]

My question is if sda one fails will it still boot on sdb? Did the install process write the boot sector on both disks or just sda? How do I check and if its not on sdb how do I copy it there?

49 thoughts on - Booting Software RAID

  • In article , Matt wrote:

    Tests I did some years ago indicated that the install process does not write grub boot information onto sdb, only sda. This was on Fedora 3
    or CentOS 4.

    I don’t know if it has changed since then, but I always put the following in the %post section of my kickstart files:

    # install grub on the second disk too grub –batch <
    Cheers Tony

  • I’ve found that the grub boot loader is only installed on the first disk. When I do use software raid, I have made a habit of manually installing grub on the other disks (using grub-install). In most cases I dedicated a RAID1 array to the host OS and have a separate array for storage.

    You can check to see that a boot loader is present with `file`.

    ~]# file -s /dev/sda
    /dev/sda: x86 boot sector; partition 1: ID=0xfd, active, starthead 1, startsector 63, 224847 sectors; partition 2: ID=0xfd, starthead 0, startsector 224910, 4016250 sectors; partition 3: ID=0xfd, starthead 0, startsector 4241160, 66878595 sectors, code offset 0x48

    There are other ways to verify the boot loader is present, but that’s the one I remember off the top of my head.

    Use grub-install to install grub to the MBR of the other disk.

  • # file -s /dev/sda
    /dev/sda: x86 boot sector; GRand Unified Bootloader, stage1 version
    0x3, boot drive 0x80, 1st sector stage2 0x849fc, GRUB version 0.94;
    partition 1: ID=0xee, starthead 0, startsector 1, 4294967295 sectors, extended partition table (last)\011, code offset 0x48

    # file -s /dev/sdb
    /dev/sdb: x86 boot sector; GRand Unified Bootloader, stage1 version
    0x3, boot drive 0x80, 1st sector stage2 0x849fc, GRUB version 0.94;
    partition 1: ID=0xee, starthead 0, startsector 1, 4294967295 sectors, extended partition table (last)\011, code offset 0x48

    I am guessing this is saying its present on both? I did nothing to copy it there so it must have done it during the CentOS 6.x install process.

  • I think the 6.x installers try to do it for you on both drives – but whether it actually works or not may depend on the type of failure and what the bios does to the disk mapping as a result. In any case it is a good idea to know how to recover from a rescue-mode boot of the install ISO.

  • and, even if you have the boot loader on both drives, there’s no guarantee your BIOS will boot from the 2nd one. Disks can partially fail in nasty ways that might allow the already-running system to stay up on the other half of the mirror, but when the drive is ‘tested’
    during power up boot sequence, it could hang the system.

  • True, but forwarding of root mail to admins e-mail address will warn about crash of mirror, so physical intervention of choice can be made. I
    think manual change in BIOS is of little consequence if system will boot off of surviving disk(s).

    And if disks can be hot-swapped then only concern is that GRUB and /boot survive the crash.

  • Doesn’t grub need to know the bios disk id for subsequent stages of the boot and where to find the root filesystem? I think it matters whether or not bios remaps your 2nd drive to the first id.

    And if you know how to do a rescue-mode boot and reinstall grub, you can fix that too.

  • It would appear so. But I’d recommend simply yanking out one drive, booting, and then swapping the drives to try booting again. You can resync the raid arrays trivially after the test. Then you know for sure. I’ve made this a matter of course for any servers where the root fs is RAID0.

    -Ben

  • GRUB boots first partition on a given disk, and then kernel boots file systems from RAID’s. Once /boot RAID is mounted, any changes are written to all disks.

  • What is the hd number you give in the setup command? And what if bios doesn’t call the remaining live disk that number after a failure?

  • I myself use InstallCD or LiveCD to create RAID partitions I want, and then Anaconda offers to create GRUB on /dev/mdX and creates /boot and GRUB on all disks (I had 3 and even 4 in mirror).

    I am not talking about seamless hardware boot, but on software one. If one HDD is totally dead, then /dev/sdb becomes /dev/sda and so on, but that does not matter since RAID is assembles from metadata on partitions
    (I use separate RAID partitions, 500MB RAID1 then the rest of the disk is RAID10,f2 or RAID10,f3 partition) so kernel line says:

    title CentOS (2.6.32-431.el6.CentOS.plus.x86_64)
    root (hd0,0)
    kernel /vmlinuz-2.6.32-431.el6.CentOS.plus.x86_64 ro root=/dev/mapper/vg_kancelarija-LV_C6_ROOT
    rd_LVM_LV=vg_kancelarija/LV_C6_ROOT
    rd_MD_UUID8669557:34ca61eb:3c6e8827:d562f15d rd_LVM_LV=vg_kancelarija/LV_C6_SWAP rd_NO_LUKS rd_NO_DM LANG=en_US.UTF-8
    SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=p c KEYTABLE=us crashkernel8M rhgb quiet nouveau.modeset=0
    rdblacklist=nouveau
    initrd /initramfs-2.6.32-431.el6.CentOS.plus.x86_64.img

    and device.map:

    # this device map was generated by anaconda
    (hd0) /dev/sda
    (hd2) /dev/sdb
    (hd1) /dev/sdc

    (From system with 3 disks)

    I can not claim for sure this works, at this point, because my last problem was long time ago, but as far as I remember I just disconnected failed disk and booted system on single disk until I got new one as replacement.

  • Based on input from everyone here I am thinking of an alternate setup. Single small inexpensive 64GB SSD used as /boot, / and swap. Putting
    /vz on software RAID1 array on the two 4TB drives. I can likely just zip tie the SSD in the 1u case somewhere since I have no more drive bays. Does this seem like a better layout?

  • Les Mikesell wrote:
    Yeah. A question – I’ve missed this discussion – you *do* have the drives partioned GPT, right? MBR is *not* compatible with > 2TB.

    mark

  • As long as you have duplicate SSD as backup and regularly backup /boot to that other SSD, it should be OK. Loosing SSD and /boot with new kernels would be a mayor problem for you, you would need to recreate them.

    Ljubomir Ljubojevic
    (Love is in the Air)
    PL Computers Serbia, Europe

    Google is the Mother, Google is the Father, and traceroute is your trusty Spiderman… StarOS, Mikrotik and CentOS/RHEL/Linux consultant

  • I am thinking I will not have much I/O traffic on the SSD hopefully extending its lifespan. If it dies I will just need to reinstall CentOS on a replacement SSD. My critical files will be in /vz that I
    need backup on and I think RAID1 will give me double read speed on that array. Trying to balance cost/performance/redundancy here.

  • From: Matt

    Not sure if it is still the case but if my memory is correct, some time ago, RH was advising against using mdraid on ssds because of the mdraid surface scans or resyncs. I cannot find the source anymore…

    JD

  • SSD will NOT have RAID on it. He wrote so. And that can take his server offline until /boot is reinstalled.

  • for quite some time (since 5.x era) i use http://wiki.CentOS.org/HowTos/Install_On_Partitionable_RAID1
    (with 6.x i don’t even need the patch to mkinitrd)

    the mbr or whatever it is is written in /dev/md_d0 .. and thats it in bios you put both hdd to boot and if the first have a problem the second will boot, mail you that you have a degraded raid and start resync after you replaced the drive. (and you can do it live)

    HTH, Adrian

  • i have no idea .. it should .. my use cases at work are the boot drives (all under 500 GB)
    and home (but i have no hdd > 2 TB)

    basically it is a raid over a block device so it does/should not matter what you write into it…

    HTH, Adrian

  • Adrian Sevcenco wrote:
    (all under 500 GB)
    As I noted in a previous post, it’s got to be GPT, not MBR – the latter doesn’t understand > 2TB, and won’t.

    On a related note, what we’ve started doing at work is partitioning our root drives four ways, as they’re now mostly 2TB that we’re putting in, instead of three: /boot, swap, and /, with that as 1G, 2G, and 500G, and the rest of the drive separate. We like protecting /, while leaving more than enough space for logs that suddenly run away. At home, I’ll probably do less for /, perhaps 100G.

    mark

  • There are few reasons why /boot should be on a partition/array >= 2TB.

    GRUB doesn’t support software raid levels other than 1 (sort of). It accesses one of the “mirrored” partitions and not the raid array itsself. So having large disks and only being able to use raid1 isn’t often optimal either.

    While 100MB was fine in the CentOS 5.x days, it only takes 512MB or so with CentOS 6.x to store a few Linux kernels (and other items – initrd, initramfs, etc).

    Once we have GRUB2, then it could be possible to boot from LVM. But partitioning /boot separate from rootfs is not a big deal.

  • Leon Fauster wrote:

    Perhaps… but fdisk can’t deal with drives > 2TB, and if I’m forced to use parted, I might as well do gpt. I don’t believe I’ve tried MBR, though I’m not sure if I’ve used 3TB for multiple partitions.

    mark

  • it also applies to 3TB drive with 3TB partition, such as you typically use with LVM.

    I don’t like using raw disk devices in most cases as they aren’t labeled so its impossible to figure out whats on them at some later date.

    just use the gpt partitioning tools and you’ll be fine. as long as its not the boot device, the BIOS doesn’t even need to know about GPT.

    |parted /dev/sdb ||”mklabel gpt”|
    |parted -a none /dev/sdb ||”mkpart primary 512s -1s”

    that creates a single /dev/sdb1 partition using the whole drive starting at the 256kB boundary (which should be a nice round place on most raid, SSD, etc… the defaults are awful, the start sector is at an odd location)
    |

  • John R Pierce wrote:

    Not a fan of that – a lot of the new drives actually use 4k blocks, *not*
    512b, but serve it logically as 512. HOWEVER, you can see a real performance hit. my usual routine is parted -a optimal mklabel gpt mkpart pri ext4 0.0GB 3001.0GB
    q and that aligns them for optimal speed. The 0.0GB will start at 1M – the old start at sector(?) 63 will result in non-optimal alignment, not starting on a cylinder boundry, or something. Note also that parted is user hostile, so you have to know the exact magical incantations, or you get “this is not aligned optimally”, but no idea of what it thinks you should do. What I did, above, works.

    mark

  • If I am putting both 4TB drives in a single RAID1 array for /vz would there be any advantage to using LVM on it?

  • My (sometimes unpopular) advice is to set up the partitions on servers into two categories:

    1) OS
    2) Data

    OS partitions don’t really grow much. Most of our servers’ OS partitions total less than 10 GB of used space after years of 24×7 use. I recommend keeping things *very* *simple* here, avoid LVM. I use simple software RAID1 with bare partitions.

    Data partitions, by definition, would be much more flexible. As your service becomes more popular, you can get caught in a double bind that can be very hard to escape: On one hand, you need to add capacity without causing downtime because people are *using* your service extensively, but on the other hand you can’t easily handle a day or so to transfer TBs of data because people are *relying* on your service extensively. To handle these cases you need something that gives you the ability to add capacity without (much) downtime.

    LVM can be very useful here, because you can add/upgrade storage without taking the system offline, and although there *is* some downtime when you have to grow the filesystem (EG when using Ext* file systems) it’s pretty minimal.

    So I would strongly recommend using something to manage large amounts of data with minimal downtime if/when that becomes a likely scenario.

    Comparing LVM+XFS to ZFS, ZFS wins IMHO. You get all the benefits of LVM
    and the file system, along with the almost magical properties that you can get when you combine them into a single, integrated whole. Some of ZFS’ data integrity features (See RAIDZ) are in “you can do that?”
    territory. The main downsides are the slightly higher risk that ZFS on Linux’ “non-native” status can cause problems, though in my case, that’s no worry since we’ll be testing any updates carefully prior to roll out.

    In any event, realize that any solution like this (LVM + XFS/Ext, ZFS, or BTRFS) will have a significant learning curve. Give yourself *time*
    to understand exactly what you’re working with, and use that time carefully.

  • Absolutely. I have been doing this, without problems, for 5 years. Keeping the two distinct is best, in my opinion.

    /data/……………

  • That’s great advice.. I’ve *across the universe* also sectioned off
    /home directory and /opt Not to counter anything here, no sir eee, to add.. to the sane request from the previous mention…

    It can make the difference sometimes with fast restores and there is a slight performance increase depending on the I/O activity and threads of the actual server and its role. Just saying… Don’t flame.. I’ve been there; plus tax and went down there and brought back the souvenir.. really :)

    At the end of the day, backups are just native commands, pick one: tar, cpio (yeah, still being used) etc. wrapped up in a script/program if you want to be a purist –

    Here’s something: I’ve done before and /after performance testing with real time data and User requests with just the ‘basic’ file partioning and then Partioning the partition
    — really does wonders.. Of course your RAID solution comes into play, here, too…. This is with CentOS (whatever Unix type system). Apple slices up pretty good on their MAC OS – // think freeBSD combined with NeXT and some other interesting concoction of lovelies… and….

    Oh, there is no counter or ‘ideal’ way to do this.. because why? EVERY
    infrastructure, culture, ‘way we do it around here..’
    dictators are very different — as always, your mileage may /vary/ == SO
    this isn’t a ‘how to’ but a nice, could do…

    Been there, got the,,, oh, I already addressed that. Have fun.. Better than digging a ditch. TASTE GREAT; LESS FILLING

    ~ so,

    /swap
    /OS – whatever you want to call it, I don’t call it OS in Unix/Linux, but that’s fine
    /opt
    /usersHOMEdir

    Pretty clean; simple.. Anyone says different, they’re justifying their job. Nothing to justify here.

    Good call though otherwise. I like it.

    Wizard of Hass!
    Left Coast

  • Paul,

    I forgot to mention with the ‘unconventional’ slicing of the Partitions, it does become unpopular in terms of ‘vendor’ support (if it applies.. ) and also expentencies on Code installs, etc. where environments are set based on ‘known knowns’ with Linux/UNIX layouts .. and the likes..

    Major chances of failures, etc. So, if insistant on slicing up /DISKS/
    and such ( I still believe in it but look at install scripts before I do and ‘tweak’ accordingly — and willing to maintain that, it’s a safe bet.. Most just click on ***.sh and blast away… Problems later..

    It’s funny how little folks actually prepare for an install/setup before going to Prod or the infamous
    ‘golive’ – but that’s how I make money to come in an fix it.. SO PLEASE
    don’t read the manual ~ and use logic here. lol.

    Wizard of Hass!
    Shazam ~

  • Just ran into this: did a grep on what seemed to be a lightly loaded server, load average suddenly spiked unexpectedly. Turns out that it was performing over 130 write ops/second on a 7200 RPM drive! Partitioning data would into partitions would have no effect in this case.

  • Exactly. Why would this be an unpopular piece of advice?

    It might even be better to keep the OS by itself on one disk (with /boot, /
    and swap) and have the data on a separate disk.

    Please enlighten me!

  • Our default partitioning scheme for new hosts, whether virtualised or not, looks something like this:

    df Filesystem 1K-blocks Used Available Use% Mounted on
    /dev/mapper/vg_inet01b-lv_root 8063408 1811192 5842616 24% /
    tmpfs 1961368 0 1961368 0% /dev/shm
    /dev/vda1 495844 118294 351950 26% /boot
    /dev/mapper/vg_inet01b-lv_tmp 1007896 51800 904896 6% /tmp
    /dev/mapper/vg_inet01b-lv_log 1007896 45084 911612 5% /var/log
    /dev/mapper/vg_inet01b-lv_spool 8063408 150488 7503320 2% /var/spool

    The capacities assigned initially vary based on expected need and available disk. As everything is an lv expanding volume sizes when required is not exceedingly burdensome. I used to keep / as a non-lv but for the past few years, since CentOS-5 I think, I have made that an lv as well and my experience to date has been positive.

    Anything expected to continually increase over time goes under /var as a new lv. For example: On systems with business applications that store transaction files we have a dedicated lv mounted at /var/data/appname (for web apps) or /var/spool/appname (for everything else). On a system hosting an RDBMS we generally give /var/lib or var/lib/dbmsname its own lv although on the dedicated dbms hosts we typically just mount all of /var as an lv.

    This is based on past experience, usually bad, where the root file-system became filled by unmanaged processes (lack of trimming stale files), or DOS
    attacks (log files generally), or unexpectedly large transaction volumes
    (/var/spool). As all but one of our hosts have no local users besides administrative accounts /home is left in root. On the remaining host that has local user accounts /home is an lv as well.

  • ..this is nice, but way overcomplicated than it needs to be. we’re talking just slices /containers/ and data /files/ at the end of the so-called day everything is a file in *ix – even printers.. whatever.

    … restoring the below would be a challenge for someone not truly intimate with this scheme. it’s great until ‘something goes wrong.. and it will’

    this reminds me of the early Linux days when the man pages were useless to people i would hire.. no examples, just raw statements from propeller-heads. who cares if you geek out all night because you don’t have a date on Friday?

    .. in any event, you never need more than /four/ or /five/ partitioning schemes, anything more, you’re just showing off… really.

    folks, I explained the other day all it takes and you’re up and running.

    KEEP IT SIMPLE. this is not complicated stuff, unless you right code in FORTRAN or something sick like that, /enter: assembly.

    about the OS (as you’re calling it, must be a cross-over from Microsoft), that partition SHOULD NEVER
    fill up, if so, fire the SysAdmin because he’s drinking or using the non-drinking kind of Coke… in mass quantities…

    You could EASILY write a script (anyone do that here?) — to monitor that /OS file system ~ and send alerts based on thresholds and triggers (notify the monitoring people before they get even notified.. it’s alot of fun!) — and put it in the crontab – // cron // – get yourself some…

    Better living through monitoring..

    If a file system EVER filled up on my watch, the whole team will be escorted out the door…. with their badges in my hand… geezus. I love these ‘guru’s’ making it complicated.

    The real talent and skill is in keeping it simple. YOU WILL have to restore this some day.. and I’ve said this before, too, YOU MAY and probably will be gone in a few years (if you’re not, you’re really not that good at this game.. as anyone that stays in an IT job more than FIVE years is not growing, except for old…. ) Really?

    Oh, I offer this as real-world, world-wide experience. Pick a country, I’ve done this stuff there.. Probably where you’re at now, even Canada, as they have electricity there now and lots of CLOUDS based on their weather patterns.

    PARTITIONS, one could do…. :

    /swap (you know the rules here; twice your memory and NOT any more, why??? do you breath more than you need?
    /home (give your users a break; makes restore VERY EASY for them….. and they won’t hate you
    /opt (lots of code gets installed her if you have a real job and use real applications and not home alone on Friday night like I think some of these posters are.
    /var (if you want for bonus points
    /socalledDATA, which really could be put in opt.

    Wherever you’re installing *ix code… that files sytem should have 52
    to 60 % of your store.

    I’m just saying…..

    That person that said the OS files system grows and crashes should edit their resume as YOU’RE FIRED!
    MONITOR.

    Oh, those LOG FILES, um, put them on the SAN — yeah, that’s it.

    I hope this helps the people that get it and don’t overly complicate things. It’s just Linux — it was mostly written by ‘one guy…’ in Scandinavia as a class-project — then they all jumped in.

    I remember that post from that infamous creator…..

    Get yourself some simplicity and contribute that… it works. AND it makes you look smarter.

    No one likes an arrogant geek.

    Wizard of Hass!
    Much more than a Linux man

  • You seem to imply something magic is going happen to performance with partitioning. There’s not really much magic in these boxes. You either move the disk head farther more frequently or you don’t. So if your test stays mostly constrained to a small slice of disk that you’ve partitioned you might think your performance is improved. But, that’s only true if the test exactly matches real-world use – that is, in normal operation, the same disk heads won’t frequently be moving to other locations to, for example, write logs.

  • But you shouldn’t have to write – and maintain – that script, because there are monitoring frameworks that are already well written to do it for you.

  • I think the somewhat unpopular part is to recommend *against* using LVM
    for the OS partitions, voting instead to KISS, and only use LVM / Btrfs
    / ZFS for the “data” part. Some people actually think LVM should be used everywhere.

    And for clarity’s sake, I’m not suggesting a literal partition /os and
    /data, simply that there are areas of the filesystem used to store operating system stuff (EG: /bin, /boot, /usr), and areas used to store data (EG: /home, /var, /tmp), etc.

    Keep the OS stuff as simple as possible, RAID1 against bare partitions, etc. because when things go south, the last thing you want is *another*
    thing to worry about.

    Keep the data part such that you can grow as needed without (much)
    downtime. EG: LVM+XFS, ZFS, etc. (And please verify your backups regularly)

    -Ben

  • The division is not at all clean, especially under /var. You’ve got stuff put there by a base OS install mingled with an unpredictable amount of logs and data.

  • indeed. and its not unusual to discover a year after deployment that you need signfiicantly more space in /usr or whatever. I generally use LVM for my boot disk

  • I’m getting more and more inclined to make the whole systems disposable/replaceable and using VMs for the smaller things instead of micro-managing volume slices. If something is running out of space it probably really needs a partition from a new disk mounted there anyway.

  • Paul Heinlein wrote:

    Eight years ago, I wrote an article for SysAdmin, suggesting a straight partition for /boot and root, and lvm for /home and /var, and /usr. These days, I might say RAID 1 for /boot and /, and RAID or not for swap, and another raid partition for everything else: home, other data directories….

    At work, we’re going to not more than 500G for /, but I’m thinking a lot less: I just rebuilt my own system at home, and gave / 150G, I think, and I have /var there (though I’d put web stuff elsewhere than on /).

    mark

  • Paul Heinlein wrote:

    That’s a *huge* amount of swap – we settled, years ago, and I think upstream recommends, 2G. Now, around here, our servers have
    *significantly* more than 2G, and if we see anything in swap, we know something’s wrong.

    mark

  • We don’t allocate Swap. Instead we have large RAM and put tmp directories etc. on a RAM disk.

  • Oh, btw, one big reason I used to use lvm was that, when I wanted to upgrade to a new release, I’m make or wipe a partition or two, and install in that partition… and then I could boot to new or old.

    mark

  • What type of network/filesystem connection between VM’s and storage you create (NFS, native, separate partitions….)?