Software RAID1 Failure Help

Home » CentOS » Software RAID1 Failure Help
CentOS 8 Comments

I am running software RAID1 on a somewhat critical server. Today I
noticed one drive is giving errors. Good thing I had RAID. I planned on upgrading this server in next month or so. Just wandering if there was an easy way to fix this to avoid rushing the upgrade? Having a single drive is slowing down reads as well, I think.

Thanks.

Feb 7 15:28:28 server smartd[2980]: Device: /dev/sdb [SAT], 1
Currently unreadable (pending) sectors Feb 7 15:28:28 server smartd[2980]: Device: /dev/sdb [SAT], 1 Offline uncorrectable sectors Feb 7 15:58:29 server smartd[2980]: Device: /dev/sdb [SAT], 1
Currently unreadable (pending) sectors Feb 7 15:58:29 server smartd[2980]: Device: /dev/sdb [SAT], 1 Offline uncorrectable sectors

[root@server ~]# smartctl -H /dev/sda

=== START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED

[root@server ~]# smartctl -H /dev/sdb

=== START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED

[root@server ~]# df -h Filesystem Size Used Avail Use% Mounted on
/dev/md2 1.4T 142G 1.2T 12% /
/dev/md0 99M 37M 58M 39% /boot tmpfs 7.9G 20K 7.9G 1% /dev/shm

[root@server ~]# cat /proc/mdstat Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md1 : active raid1 sdb2[1] sda2[0]
8385856 blocks [2/2] [UU]

md2 : active raid1 sdb3[2](F) sda3[0]
1456645568 blocks [2/1] [U_]

unused devices:

[root@server ~]# mdadm –detail /dev/md2
/dev/md2:
Version : 0.90
Creation Time : Tue Jan 4 05:39:36 2011
Raid Level : raid1
Array Size : 1456645568 (1389.17 GiB 1491.61 GB)
Used Dev Size : 1456645568 (1389.17 GiB 1491.61 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 2
Persistence : Superblock is persistent

Update Time : Fri Feb 7 15:21:45 2014
State : active, degraded Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0

Events : 0.758203

Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 0 0 1 removed

2 8 19 – faulty spare /dev/sdb3

[root@server ~]# mdadm –detail /dev/md1
/dev/md1:
Version : 0.90
Creation Time : Tue Jan 4 05:39:36 2011
Raid Level : raid1
Array Size : 8385856 (8.00 GiB 8.59 GB)
Used Dev Size : 8385856 (8.00 GiB 8.59 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Fri Feb 7 14:29:36 2014
State : clean Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Events : 0.460

Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 8 18 1 active sync /dev/sdb2

[root@server ~]# mdadm –detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Tue Jan 4 05:48:17 2011
Raid Level : raid1
Array Size : 104320 (101.89 MiB 106.82 MB)
Used Dev Size : 104320 (101.89 MiB 106.82 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Wed Feb 5 11:02:25 2014
State : clean Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Events : 0.460

Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1

8 thoughts on - Software RAID1 Failure Help

  • Sure, replace the bad drive and rebuild build the mirror. Or add 2 drives, using one to replace the bad one and the other as a hot spare. Then if one of the 2 drives in the mirror fails again, the hot spare will take over for it.

    Chris

  • one thing thats always bugged me about md raid when its mirrored per partition… say you have /dev/sda{1,2,3} mirrored with
    /dev/sdb{1,2,3} … if /dev/sdb2 goes south with too many errors and mdraid fails it, will it know that b{1,3} are also on the same physical drive and should be failed, or will it wait til it gets errors on them, too?

  • Maybe it is slowing things down, but I would recommend you fix your RAID1
    mirror to avoid losing all your data.

    Hopefully the information below helps you…

    If you have hotswap drives/caddies, then you should be able to remove the drives while the server continues running. First, hot fail and hot remove
    [0] all raid members for that drive /dev/sdb from any software raid arrays you have. Next step is to remove the drive from the SCSI subsystem [1]. Next step is to physically remove the drive and replace it with healthy one. Make the OS detect the new drive [2]. From there, you can use sfdisk to clone the partition structure from the working drive to the new one. Then add the new partitions to your software raid arrays (and watch
    /proc/mdstat as it rebuilds).

    -f or –fail
    -r or –remove
    -a or –add

    mdadm /dev/mdX -f /dev/sdbY
    mdadm /dev/mdX -r /dev/sdbY

    sfdisk -d /dev/sda | sfdisk /dev/sdb

    mdadm /dev/mdX -a /dev/sdbY

    watch /proc/mdstat

    [0] http://www.ducea.com/2009/03/08/mdadm-cheat-sheet/
    [1]
    https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Online_Storage_Reconfiguration_Guide/removing_devices.html
    [2]
    https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/adding_storage-device-or-path.html

  • I use RAID per partition, and had several disks failing. It will treat them separate. Only if whole disk disappears will mdadm report all of them failing. If only partition has failed, it will try to rebuild it at once.

    Here is mail sent on error:

    This is an automatically generated mail message from mdadm running on vmaster.xxx

    A Fail event had been detected on md device /dev/md0.

    It could be related to component device /dev/sdb2.

    Faithfully yours, etc.

    P.S. The /proc/mdstat file currently contains the following:

    Personalities : [raid1]
    md0 : active raid1 sdb2[2](F) sda2[1]
    488279488 blocks [2/1] [_U]

    unused devices:

  • This depends upon how the RAID is set up.

    A standard Linux RAID1 setup does not give better reading performance when reading large files than a single disk.

    I don’t know if the RAID system is cleaver enough to save some seek time.

    In order to get better read performance you’ll have to set it up as RAID10 with far copies.

    Mogens

  • Maybe, if your system isn’t doing anything but reading that one file…

    I don’t think it is particularly smart, but it can alternate reads between drives so the heads can be seeking to different places simultaneously. Of course intermingled writes will force the heads to the same place on both, though.

  • No, mdraid 1 is mdraid 1.

    Process X is utilizing only one single disk, so no performance gain. But if you have 2 processes in parallel, then you potentially have a gain, because the process Y can read from another disk.

    process X -> disk 0
    process Y -> disk 1

    Yes, mdraid 10 could be a solution for the “1 process should utilize more than one disk” goal. I haven’t tried it though, what a shame.

    Why is that far copies thing important?