How To Replace A Raid Drive With Mdadm

Home » CentOS » How To Replace A Raid Drive With Mdadm
CentOS 3 Comments

Hi all

If we loose a drive in a raid 10 array (mdadm software raid) what are the steps needed to correctly do the following:
– identify which physical drive it is
– replace the drive
– add the new drive to the array and force it to re-sync

Thanks in advance

3 thoughts on - How To Replace A Raid Drive With Mdadm

  • This is controller dependent. Some support blinking the drive light to identify it, others do not. If yours does not you need to jury-rig something (e.g., either physically label the drive slot/drive, or send some dummy data to the drive to get it to blink).

    The md part is easy. If md hasn’t failed the drive already, then you need to do that first:

    mdadm /dev/mdN –fail /dev/sdXX

    Then remove it from the array:

    mdadm /dev/mdN –remove /dev/sdXX

    The physical part is, again, hardware dependent.

    Again, physical part hardware dependent. Once the kernel knows about your new drive, this should work (partition the drive if needed beforehand):

    mdadm /dev/mdN –add /dev/sdYY

    There may be extra parameters for replacing a failed RAID10 drive, but I
    suspect that md already knows the needed parameters, so just adding the drive should kick off a rebuild of the failed member.

  • This can also be inverted especially if you cannot send data to the drive anymore because it dies completely: Create lots of disk i/o with a command like “grep -nri test /usr” and all drives except the broken one should show activity.

    Another way is to write down the serial numbers of the disks, the slots you put the disks in and then use hdparm -I /dev/sdX to find which device shows which serial number. That way once sdX dies you can check the list to find which slot the disk for the failed device was put in.

    Regards,
    Dennis

  • That’s certainly a good idea. If you have multiple arrays you’d need to send that IO to each array at mostly the same time, but with only one array it’s less difficult. I think the most challenging scenario would be if the array has multiple spares–if the array rebuilds before you can look at it, then you have to generate IO on the array and on the drive(s) that are still spares.

    If you have no active spares (either you started with none, or you had one and it’s been used to replace the dead drive), one way to make IO
    is to start a check of the md array (e.g., echo check > /sys/block/mdN/md/sync_action ). The drive that doesn’t blink is the dead one.

    Physical labelling in this way (or some other way) is still the best solution, as long as you keep the list up to date (and don’t screw up the list, of course). But it’s definitely good to have multiple methods in your toolbox–for example, you might try the IO trick, then cross-check it against your physical labels. Better to take some extra time verifying which drive is dead than to pull the wrong one!

    –keith