How To Replace A Raid Drive With Mdadm

Home » CentOS » How To Replace A Raid Drive With Mdadm

May 10, 2014 CS DBA CentOS 3 Comments

Hi all

If we loose a drive in a raid 10 array (mdadm software raid) what are the steps needed to correctly do the following:
– identify which physical drive it is
– replace the drive
– add the new drive to the array and force it to re-sync

Thanks in advance

3 thoughts on - How To Replace A Raid Drive With Mdadm

Keith Keller says:

May 10, 2014 at 12:07 pm

This is controller dependent. Some support blinking the drive light to identify it, others do not. If yours does not you need to jury-rig something (e.g., either physically label the drive slot/drive, or send some dummy data to the drive to get it to blink).

The md part is easy. If md hasn’t failed the drive already, then you need to do that first:

mdadm /dev/mdN –fail /dev/sdXX

Then remove it from the array:

mdadm /dev/mdN –remove /dev/sdXX

The physical part is, again, hardware dependent.

Again, physical part hardware dependent. Once the kernel knows about your new drive, this should work (partition the drive if needed beforehand):

mdadm /dev/mdN –add /dev/sdYY

There may be extra parameters for replacing a failed RAID10 drive, but I
suspect that md already knows the needed parameters, so just adding the drive should kick off a rebuild of the failed member.
Dennis Jacobfeuerborn says:

May 10, 2014 at 12:29 pm

This can also be inverted especially if you cannot send data to the drive anymore because it dies completely: Create lots of disk i/o with a command like “grep -nri test /usr” and all drives except the broken one should show activity.

Another way is to write down the serial numbers of the disks, the slots you put the disks in and then use hdparm -I /dev/sdX to find which device shows which serial number. That way once sdX dies you can check the list to find which slot the disk for the failed device was put in.

Regards,
Dennis
Keith Keller says:

May 10, 2014 at 6:07 pm

That’s certainly a good idea. If you have multiple arrays you’d need to send that IO to each array at mostly the same time, but with only one array it’s less difficult. I think the most challenging scenario would be if the array has multiple spares–if the array rebuilds before you can look at it, then you have to generate IO on the array and on the drive(s) that are still spares.

If you have no active spares (either you started with none, or you had one and it’s been used to replace the dead drive), one way to make IO
is to start a check of the md array (e.g., echo check > /sys/block/mdN/md/sync_action ). The drive that doesn’t blink is the dead one.

Physical labelling in this way (or some other way) is still the best solution, as long as you keep the list up to date (and don’t screw up the list, of course). But it’s definitely good to have multiple methods in your toolbox–for example, you might try the IO trick, then cross-check it against your physical labels. Better to take some extra time verifying which drive is dead than to pull the wrong one!

–keith

How To Replace A Raid Drive With Mdadm

3 thoughts on - How To Replace A Raid Drive With Mdadm

Recommended

Recent Posts

Recent Comments

Archives

Categories

Meta