Replacing SW RAID-1 With SSD RAID-1

Home » CentOS » Replacing SW RAID-1 With SSD RAID-1
CentOS 23 Comments

Hi,

I want to replace my hard drives based SW RAID-1 with SSD’s.

What would be the recommended procedure? Can I just remove one drive, replace with SSD and rebuild, then repeat with the other drive?

Thanks Frank

23 thoughts on - Replacing SW RAID-1 With SSD RAID-1

  • I suggest to “mdadm –fail” one drive, then “mdadm –remove” it. After replacing the drive you can “mdadm –add” it.

    If you boot from the drives you also have to care for the boot loader. I
    guess this depends on how exactly the system is configured.

    Regards, Simon

  • Thanks, that’s what I had in mind. Of course, I will rebuild grab2 after each iteration.

    Thanks Fra

  • You could also grow the array to add in the new devices before removing the old HDDs ensuring you retain at least 2 devices in the array at any one time. For example, in an existing raid of sda1 and sdb1, add in sdc1
    before removing sda1 and add sdd1 before removing sdb1, finally shrinking the array back to 2 devices:

    mdadm –grow /dev/md127 –level=1 –raid-devices=3 –add /dev/sdc1
    mdadm –fail /dev/md127 /dev/sda1
    mdadm –remove /dev/md127 /dev/sda1
    mdadm /dev/md127 –add /dev/sdd1
    mdadm –fail /dev/md127 /dev/sdb1
    mdadm –remove /dev/md127 /dev/sdb1
    mdadm –grow /dev/md127 –raid-devices=2

    then reinstall grub to sdc and sdd once everything has fully sync’d:

    blockdev –flushbufs /dev/sdc1
    blockdev –flushbufs /dev/sdd1
    grub2-install –recheck /dev/sdc grub2-install –recheck /dev/sdd

  • then grow, add, wait, fail, remove, shrink. That way you will never loose redundancy…

    # grow and add new disk

    mdadm –grow -n 3 /dev/mdX -a /dev/…

    # wait for rebuild of the array

    mdadm –wait /dev/mdX

    # fail old disk

    mdadm –fail /dev/sdY

    # remove old disk

    mdadm /dev/mdX  –remove /dev/sdY

    # add second disk

    mdadm /dev/mdX –add /dev/…

    # wait

    mdadm –wait /dev/mdX

    # fail and remove old disk

    mdadm –fail /dev/sdZ

    mdadm /dev/mdX  –remove /dev/sdZ

    # shrink

    mdadm  –grow -n 2 /dev/mdX

    peter

  • You do have a recent backup available anyway, haven’t you? That is: Even without planning to replace disks. And testing such strategies/sequences using loopback devices is definitely a good idea to get used to the machinery…

    On a side note: I have had a fair number of drives die on me during RAID-rebuild so I would try to avoid (if at all possible) to deliberately reduce redundancy just for a drive swap. I have never had a problem (yet) due to a problem with the RAID-1 kernel code itself. And:
    If you have to change a disk because it already has issues it may be dangerous to do a backup – especially if you do a file based backups –
    because the random access pattern may make things worse. Been there, done that…

    peter

  • Sure, and for large disks I even go further: don’t put the whole disk into one RAID device but build multiple segments, like create 6 partitions of same size on each disk and build six RAID1s out of it. So, if there is an issue on one disk in one segment, you don’t lose redundancy of the whole big disk. You can even keep spare segments on separate disks to help in case where you can not quickly replace a broken disk. The whole handling is still very easy with LVM on top.

    Regards, Simon

  • Oh, boy, what a mess this will create! I have inherited a machine which was set up by someone with software RAID like that. You need to replace one drive, other RAIDs which that drive’s other partitions are participating are affected too.

    Now imagine that somehow at some moment you have several RAIDs each of them is not redundant, but in each it is partition from different drive that is kicked out. And now you are stuck unable to remove any of failed drives, removal of each will trash one or another RAID (which are not redundant already). I guess the guy who left me with this setup listened to advises like the one you just gave. What a pain it is to deal with any drive failure on this machine!!

    It is known since forever: The most robust setup is the simplest one.

    One can do a lot of fancy things, splitting things on one layer, then joining them back on another (by introducing LVM)… But I want to repeat it again:

    The most robust setup is the simplest one.

    Valeri

  • –Does it make sense to dd or ddrescue from the removed drive to the replacement? My md RAID set is on primary partitions, not raw drives, so I’m assuming the replacement drive needs at least the boot sector from the old drive to copy the partition data.

  • I used to do something like this (but because there isn’t enough detail in the above I am not sure if we are talking the same thing). On older disks having RAID split over 4 disks with / /var /usr /home allowed for longer redundancy because drive 1 could have a ‘failed’ /usr but drive 0,2,3,4
    were ok and the rest all worked n full mode because /, /var, /home/, were all good. This was because most of the data on /usr would be in a straight run on each disk. The problem is that a lot of modern disks do not guarantee that data for any partition will be really next to each other on the disk. Even before SSD’s did this for wear leveling a lot of disks did this because it was easier to allow the full OS which runs in the Arm chip on the drive do all the ‘map this sector the user wants to this sector on the disk’ in whatever logic makes sense for the type of magnetic media inside. There is also a lot of silent rewriting going on the disks with the real capacity of a drive can be 10-20% bigger with those sectors slowly used as failures in other areas happen. When you start seeing errors, it means that the drive has no longer any safe sectors and probably has written /usr all over the disk in order to try to keep it going as long as it could.. the rest of the partitions will start failing very quickly afterwards.

    Not all disks do this but a good many of them do from commercial SAS to commodity SATA.. and a lot of the ‘Red’ and ‘Black’ NAS drives are doing this also..

    While I still use partition segments to spread things out, I do not do so for failure handling anymore. And if what I was doing isn’t what the original poster was meaning I look forward to learning it.

  • I understand that, I also like keeping things simple (KISS).

    Now, in my own experience, with these multi terabyte drives today, in 95%
    of the cases where you get a problem it is with a single block which can not be read fine. A single write to the sector makes the drive remap it and problem is solved. That’s where a simple resync of the affected RAID
    segment is the fix. If a drive happens to produce such a condition once a year, there is absolutely no reason to replace the drive, just trigger the remapping of the bad sector and and drive will remember it in the internal bad sector map. This happens all the time without giving an error to the OS level, as long as the drive could still read and reconstruct the correct data.

    In the 5% of cases where a drive really fails completely and needs replacement, you have to resync the 10 RAID segments, yes. I usually do it with a small script and it doesn’t take more than some minutes.

    The good things is that LVM has been so stable for so many years that I
    don’t think twice about this one more layer. Why is a layered approach worse than a fully included solution like ZFS? The tools differ but some complexity always remains.

    That’s how I see it, Simon

  • I don’t do it the same way on every system. But, on large multi TB system with 4+ drives, doing segmented raid has helped very often. There is one more thing: I always try to keep spare segments. Now, when a problem hows up, the first thing is to pvmove the broken raid data to wherever there is free space. One command and some minutes later the system is again fully redundant. LVM is really nice for such things as you can move filesystems around as long as they share the same VG. I also use LVM to optimize storage by moving things to faster or slower disks after adding storage or replacing it.

    Regards, Simon

  • It is one story if you administer one home server. It is quite different is you administer a couple of hundreds of them, like I do. And just 2-3
    machines set up in such a disastrous manner as I just described suck
    10-20 times more of my time each compared to any other machine – the ones I configured hardware for myself, and set up myself, then you are entitled to say what I said.

    Hence the attitude.

    Keep things simple, so they do not suck your time – if you do it for living.

    But if it is a hobby of yours – the one that takes all your time, and gives you a pleasure just to fiddle with it, then it’s your time, and your pleasure, do it the way to get more of it ;-)

    Valeri

  • zpool create newpool mirror sdb sdc mirror sdd sde mirror sdf sdg mirror sdh sdi spare sdj sdk zfs create -o mountpoint=/var/lib/pgsql-11 newpool/postgres11

    and done.

  • This *might* be a valid answer if zfs was supported on plain CentOS…
    (and if the question hadn’t involved an existing RAID ;-) ). Or did I
    miss something?

    peter

  • Your assumptions about my work environment are quite wrong.

    It was a hobby 35 years ago coding in assembler and designing PCBs for computer extensions.

    Simon

  • Great, then you are much mightier than I am in managing fast something set up very sophisticated way. That is amazing: managing sophisticated things the same fast as managing simple straightforward things ;-)

    I also noticed one more sophistication you do: you always strip off the name of the poster you reply to. ;-)

    Oh, great, we are the same of a kind. I did design electronics, and made PCBs both as hobby and for living. And I still do it as a hobby. I also did programming both as hobby and for living. The funniest was: I wrote for single board Z-80 processor based computer: assembler, disassembler, and emulator (that emilated what that Z-80 will do running some program). I did it on Wang 2200 (actually replica of such), and I
    programmed it, believe it or not, in Basic. That was the only language available for us on that machine. The ugly simple interpretive language with all variables global…

    But now I’m sysadmin. And – for me at least – the simplest possible setup is the one that will be most robust. And it will be the easiest and fastest to maintain (both for me or for someone else if one steps in to do it instead of me).

    Valeri

  • Just one reason is that you lose visibility of lower-level elements from the top level.

    You gave the example of a bad block in a RAID. What current RHEL type systems can’t tell you when that happens is which file is affected.

    ZFS not only can tell you that, deleting or replacing the file will fix the array. That’s the bottom-most layer (disk surface) telling the top-most layer (userspace) there’s a problem, and user-space fixing it by telling the bottom-most layer to check again.

    Because ZFS is CoW, this isn’t forcing the drive to rewrite that sector, it’s a new set of sectors being brought into use, and the old ones released. The sector isn’t retried until the filesystem reassigns those sectors.

    Red Hat is attempting to fix all this with Stratis, but it’s looking to take years and years for them to get there. ZFS is ready today.

    In my experience, ZFS hides a lot of complexity, and it is exceedingly rare to need to take a peek behind the curtains.

    (And if you do, there’s the zdb command.)

  • I disagree.

    It is ready today only if you are willing to abandon Linux entirely and switch to BSD, or run a Linux distro like Ubuntu that is possibly violating a license. 3rd-party repositories that use dkms can be dangerous for a storage service, and I’d prefer to keep compilers out of my servers.

    I’m not willing to move away from CentOS and am ethically bound not to violate the GPL. I would say that unless the ZFS project can fix their license, then it would be ready for Linux.

    At least with Stratis, there’s an attempt to work within the Linux world. I’m excited to see Fedora making btrfs as the default root filesystem, too.

  • Same setup I’ve been using for 15 years at least. Just have a standard partition size and keep using that (or multiple of that, e.g. 256GiB, then 512GiB, than 1024MiB), so to keep numbers down.

    Best regards.

  • Thanks for sharing! Interesting to hear that some people did the same or similar things as I did without knowing from each other.

    IIRC initially I started to do this when I got a server with different disk sizes and different paths to the disks. Think of some 18G disks, some
    36G, some 73G and also some 146G. Now, if you have to make the storage redundant for disk failures and also for single path failures, you get creative how to cut the larger disks into slices and spread the mirror pairs over the paths.

    It proved to be quite flexible in the end and still allowed the extension of the storage without any downtime. Needless to say that the expensive hardware RAID controllers have been removed from the box and replaced by simple SCSI controllers – because the hardware just couldn’t do what was required here.

    Regards, Simon