Problem With Softwareraid

Home » CentOS » Problem With Softwareraid
CentOS 6 Comments

Hello all,

i have already had a discussion on the software raid mailinglist and i want to switch to this one :)

I am having a really strange problem with my md0 device running CentOS7. after a new start of my server the md0 was gone. now after trying to find the problem i detected the following:

Booting any installed kernel gives me NO md0 device. (ls /dev/md*
doesnt give anything). a ‘cat /proc/partitions show me now
/dev/sd[a-d]1 partition. partprobe and a mdadm assemble gives me “disk busy”

[root@quad live]# cat mdstat Personalities : [raid6] [raid5] [raid4] [raid10]
unused devices:

[root@quad ~]# partprobe device-mapper: remove ioctl on WDC_WD20EFRX-68AX9N0_WD-WMC301255087p1
failed: Device or resource busy Warning: parted was unable to re-read the partition table on
/dev/mapper/WDC_WD20EFRX-68AX9N0_WD-WMC301255087 (Device or resource busy). This means Linux won’t know anything about the modifications you made.

[root@quad ~]# mdadm –assemble –force /dev/md0 /dev/sda1 /dev/sdb1
/dev/sdc1 /dev/sdd1
mdadm: /dev/sda1 is busy – skipping mdadm: /dev/sdb1 is busy – skipping mdadm: /dev/sdc1 is busy – skipping mdadm: /dev/sdd1 is busy – skipping

booting from a usb stick for rescue my CentOS everything works. the md0 device exists and is mounted. (rw).

[root@quad usb-rescue]# cat mount | grep ‘/data’
/dev/mapper/data-store on /mnt/sysimage/store type xfs
/dev/mapper/data-tm on /mnt/sysimage/var/lib/vdr/video type xfs

3rd option: i am booting the installed rescue kernel from disk:
i am getting a md0 device, but its not started. when i stop the md0 i cant assemble it anymore (disk busy)

Version : 1.2
Creation Time : Wed Aug 20 19:28:52 2014
Raid Level : raid5
Used Dev Size : 1953382272 (1862.89 GiB 2000.26 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent

Update Time : Thu Aug 17 22:38:14 2017
State : active, Not Started Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 128K

Name : (local to host
UUID : 9d020f27:c0542472:b95a18d2:5741114d
Events : 25458

Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2 8 33 2 active sync /dev/sdc1
4 8 49 3 active sync /dev/sdd1

anyone got an idea, in which direction the problem could be? more logs needed? please help, i have no ideas anymore.

regards Andy

6 thoughts on - Problem With Softwareraid

  • 18. Aug 2017 13:35 by

    Are you definately using cables rated for sata III?  Have you checked the power connections?  Have you checked the power supply voltages durning spin up/later? 

    Is there tension or major twisting forces on the sata cables?   I’ve seen this cause intermittent problems and was solved by using a longer cable that reduced the stress at the connector.

    Are the drives getting hot (your’ model shouldn’t have a heat issue under normal conditions).  Are the drives bolted into a system?  Drives can be sensitive to vibration and identical, unmounted drives will tend to shake each other and can produce rotational torque as well (especially when the same model as they’ll all have the same resonances in that case).  Either can cause problems with keeping the heads over the track reliably.

    I’d definately run all the smart test.  start with the conveyance test and then the short self test, and possibly the long test.   do check the drive temperatures immediately after each test to make sure they aren’t getting too hot.

    I assume you’ve done an fsck on the file systems?  If not it might be good to check.

    Are you using the mother boards sata interfaces or an add-on card?  If using a card i’d check the firmware version on the card and what the manufacturer is offering for updates.

    Are the drives still under warranty?  If so try WD tech support.  Also check that all the Raid tools are properly installed with their’ dependencies met.  could be other hardware/drivers interfering.  might reset the bios to “optimized settings”.  Which software raid package are you using?

    Other than that I’d possibly suspect a software problem, not familiar with software raids myself (haven’t used on, know what they are).  Or possibly a problem with the drive that is intermitant or complex in how it fails.

  • Hello Gordon,

    yeah. it is really strange. from one boot to the next, everyhing is f** up.(2 months between).

    any idea?

    [root@quad live]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sda 8:0 0 1.8T 0 disk
    ├─sda1 8:1 0 1.8T 0 part
    └─WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260 253:3 0 1.8T 0 mpath
    └─WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260p1 253:8 0 1.8T 0 part sdb 8:16 0 1.8T 0 disk
    ├─sdb1 8:17 0 1.8T 0 part
    └─WDC_WD20EFRX-68AX9N0_WD-WMC301255087 253:4 0 1.8T 0 mpath
    └─WDC_WD20EFRX-68AX9N0_WD-WMC301255087p1 253:9 0 1.8T 0 part sdc 8:32 0 1.8T 0 disk
    ├─sdc1 8:33 0 1.8T 0 part
    └─WDC_WD20EFRX-68EUZN0_WD-WCC4M2668622 253:5 0 1.8T 0 mpath
    └─WDC_WD20EFRX-68EUZN0_WD-WCC4M2668622p1 253:7 0 1.8T 0 part sdd 8:48 0 1.8T 0 disk
    ├─sdd1 8:49 0 1.8T 0 part
    └─WDC_WD20EFRX-68EUZN0_WD-WMC4M2878723 253:2 0 1.8T 0 mpath
    └─WDC_WD20EFRX-68EUZN0_WD-WMC4M2878723p1 253:6 0 1.8T 0 part sde 8:64 0 119.2G 0 disk
    ├─sde1 8:65 0 500M 0 part /boot
    └─sde2 8:66 0 118.8G 0 part
    ├─CentOS-swap 253:0 0 2G 0 lvm [SWAP]
    ├─CentOS-root 253:1 0 50G 0 lvm /
    └─CentOS-home 253:10 0 66.8G 0 lvm /home

  • yeah. the setup is running for years now. as i said: booting from usb stick -> everything works

    nope. check and works.

    nope. the issue arised first time after the box was down for several hours. the box is in my cellar so in a good environment.

    output of the test after my reply.

    no i did not. i am running xfs. and the filesystem ist not corrupt. so no repair needed. i can access the data when booting from usb.

    motherboard sata. hp microserver gen8

    mdadm has nom dependencies. but i reinstalled the package. version

    software problem sounds great. i would like to find out, why its not working. i could reinstalled the complete box, but that is not my intension. takes lots of time and i am not learning something new :)

    regards Andy

  • You haven’t said anything about multipath hardware yet, and you’ve been referring to “sda1”, etc, which makes me think that you probably don’t have multipath hardware.

    If that’s true, then the problem is probably that someone installed the multipath software on this system after the last time it booted successfully. One fix could be to boot from install media and use rescue mode to get a shell. Inside the rescue environment, remove the multipath software.

    If you did have multipath hardware, you’d be assembling the multipath targets, like WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260p1, rather than sda1.

  • hello Gardon,

    thank you for the tip. I had an eye on multipathd during my debugging, but i ignored it, because i had installed it for years now. (and a stop of the service still gave me the device busy stuff). i assume that another rpm has enabled the multipath service and this was fiddling around.

    disabling the multipath stuff helped.

    thank you for your hint again!

    got my kudos .)

    regards Andreas