Looking For A Life-save LVM Guru

Home » CentOS » Looking For A Life-save LVM Guru
CentOS 31 Comments

Dear All,

I am in desperate need for LVM data rescue for my server. I have an VG call vg_hosting consisting of 4 PVs each contained in a separate hard drive (/dev/sda1, /dev/sdb1, /dev/sdc1, and /dev/sdd1). And this LV: lv_home was created to use all the space of the 4 PVs.

Right now, the third hard drive is damaged; and therefore the third PV
(/dev/sdc1) cannot be accessed anymore. I would like to recover whatever left in the other 3 PVs (/dev/sda1, /dev/sdb1, and /dev/sdd1).

I have tried with the following:

1. Removing the broken PV:

# vgreduce –force vg_hosting /dev/sdc1
Physical volume “/dev/sdc1” still in use

# pvmove /dev/sdc1
No extents available for allocation

2. Replacing the broken PV:

I was able to create a new PV and restore the VG Config/meta data:

# pvcreate –restorefile … –uuid … /dev/sdc1
# vgcfgrestore –file … vg_hosting

However, vgchange would give this error:

# vgchange -a y
device-mapper: resume ioctl on failed: Invalid argument
Unable to resume vg_hosting-lv_home (253:4)
0 logical volume(s) in volume group “vg_hosting” now active

Could someone help me please???
I’m in dire need for help to save the data, at least some of it if possible.

Regards, Khem

31 thoughts on - Looking For A Life-save LVM Guru

  • your data is spread across all 4 drives, and you lost 25% of it. so only
    3 out of 4 blocks of data still exist. good luck with recovery.

  • Thank you, John for your quick reply. That is what I hope. But how to do it? I cannot even activate the LV with the remaining PVs.

    Thanks, Khem

  • Dear James,

    Thank you for being quick to help. Yes, I could see all of them:

    # vgs
    # lvs
    # pvs

    Regards, Khem

  • Dear John,

    I understand; I tried it in the hope that, I could activate the LV again with a new PV replacing the damaged one. But still I could not activate it.

    What is the right way to recover the remaining PVs left?

    Regards, Khem

  • Hello James and All,

    For your information, here’s the listing looks like:

    [root@localhost ~]# pvs
    PV VG Fmt Attr PSize PFree
    /dev/sda1 vg_hosting lvm2 a– 1.82t 0
    /dev/sdb2 vg_hosting lvm2 a– 1.82t 0
    /dev/sdc1 vg_hosting lvm2 a– 1.82t 0
    /dev/sdd1 vg_hosting lvm2 a– 1.82t 0
    [root@localhost ~]# lvs
    LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
    lv_home vg_hosting -wi-s—– 7.22t
    lv_root vg_hosting -wi-a—– 50.00g
    lv_swap vg_hosting -wi-a—– 11.80g
    [root@localhost ~]# vgs
    VG #PV #LV #SN Attr VSize VFree
    vg_hosting 4 3 0 wz–n- 7.28t 0
    [root@localhost ~]#

    The problem is, when I do:

    [root@localhost ~]# vgchange -a y
    device-mapper: resume ioctl on failed: Invalid argument
    Unable to resume vg_hosting-lv_home (253:4)
    3 logical volume(s) in volume group “vg_hosting” now active

  • Next time, try “vgreduce –removemissing ” first.

    In my experience, any lvm command using –force often has undesirable side effects.

    Regarding getting the lvm functioning again, there is also a –partial option that is sometimes useful with the various vg* commands with a missing PV (see man lvm).

    And “vgdisplay -v” often regenerates missing metadata (as in getting a functioning lvm back).

    Steve

  • OK It’s extremely rude to cross post the same question across multiple lists like this at exactly the same time, and without at least showing the cross posting. I just replied to the one on Fedora users before I
    saw this post. This sort of thing wastes people’s time. Pick one list based on the best case chance for response and give it 24 hours.

    Chris Murphy

  • https://lists.fedoraproject.org/pipermail/users/2015-February/458923.html

    I don’t see how the VG metadata is restored with any of the commands suggested thus far. I think that’s vgcfgrestore. Otherwise I’d think that LVM has no idea how to do the LE to PE mapping.

    In any case, this sounds like a data scraping operation to me. XFS
    might be a bit more tolerant because AG’s are distributed across all 4
    PV’s in this case, and each AG keeps its own metadata. But I still don’t think the filesystem will be mountable, even read only. Maybe testdisk can deal with it, and if not then debugfs -c rdump might be able to get some of the directories. But for sure the LV has to be active. And I expect modifications (resizing anything, fscking)
    astronomically increase the chance of total data loss. If it’s XFS
    xfs_db itself is going to take longer to read and understand than just restoring from backup (XFS has dense capabilities).

    On the other hand, Btrfs can handle this situation somewhat well so long as the fs metadata is raid1, which is the mkfs default for multiple devices. It will permit degraded mounting in such a case so recovery is straightforward. Missing files are recorded in dmesg.

    Chris Murphy

  • take a filing cabinet packed full of 10s of 1000s of files of 100s of pages each, with the index cards interleaved in the files, and remove
    1/4th of the pages in the folders, including some of the indexes… and toss everything else on the floor… this is what you have. 3 out of 4 pages, semi-randomly with no idea whats what.

    a LV built from PV’s that are just simple drives is something like RAID0, which isn’t RAID at all, as there’s no redundancy, its AID-0.

  • And this is why I don’t like LVM to begin with. If one of the drives dies, you’re screwed not only for the data on that drive, but even for data on remaining healthy drives.

    I never really saw the point of LVM. Storing data on plain physical partitions, having an intelligent directory structure and a few wise well-placed symlinks across the drives can go a long way in having flexible storage, which is way more robust than LVM. With today’s huge drive capacities, I really see no reason to adjust the sizes of partitions on-the-fly, and putting several TB of data in a single directory is just Bad Design to begin with.

    That said, if you have a multi-TB amount of critical data while not having at least a simple RAID-1 backup, you are already standing in a big pile of sh*t just waiting to become obvious, regardless of LVM and stuff. Hardware fails, and storing data without a backup is just simply a disaster waiting to happen.

    Best, :-)
    Marko

  • If the LE to PE relationship is exactly linear, as in, the PV, VG, LV
    were all made at the same time, it’s not entirely hopeless. There will be some superblocks intact so scraping is possible.

    I just tried this with a 4 disk LV and XFS. I removed the 3rd drive. I
    was able to activate the LV using:

    vgchange -a y –activationmode partial

    I was able to mount -o ro but I do get errors in dmesg:
    [ 1594.835766] XFS (dm-1): Mounting V4 Filesystem
    [ 1594.884172] XFS (dm-1): Ending clean mount
    [ 1602.753606] XFS (dm-1): metadata I/O error: block 0x5d780040
    (“xfs_trans_read_buf_map”) error 5 numblks 16
    [ 1602.753623] XFS (dm-1): xfs_imap_to_bp: xfs_trans_read_buf()
    returned error -5.

    # ls -l ls: cannot access 4: Input/output error total 0
    drwxr-xr-x. 3 root root 16 Feb 27 20:40 1
    drwxr-xr-x. 3 root root 16 Feb 27 20:43 2
    drwxr-xr-x. 3 root root 16 Feb 27 20:47 3
    ??????????? ? ? ? ? ? 4

    # cp -a 1/ /mnt/btrfs cp: cannot stat ‘1/usr/include’: Input/output error cp: cannot stat ‘1/usr/lib/alsa/init’: Input/output error cp: cannot stat ‘1/usr/lib/cups’: Input/output error cp: cannot stat ‘1/usr/lib/debug’: Input/output error
    […]

    And now in dmesg, thousands of
    [ 1663.722490] XFS (dm-1): metadata I/O error: block 0x425f96d0
    (“xfs_trans_read_buf_map”) error 5 numblks 8

    Out of what should have been 3.5GB of data in 1/, I was able to get 452MB.

    That’s not so bad for just a normal mount and copy. I am in fact shocked the file system mounts, and stays mounted. Yay XFS.


    Chris Murphy

  • OK so ext4 this time, with new disk images. I notice at mkfs.ext4 that each virtual disk goes from 2MB to 130MB-150MB each. That’s a lot of fs metadata, and it’s fairly evenly distributed across each drive.

    Copied 3.5GB to the volume. Unmount. Poweroff. Killed 3rd of 4. Boot. Mounts fine. No errors. HUH surprising. As soon as I use ls though:

    [ 182.461819] EXT4-fs error (device dm-1): __ext4_get_inode_loc:3806:
    inode #43384833: block 173539360: comm ls: unable to read itable block

    # cp -a usr /mnt/btrfs cp: cannot stat ‘usr’: Input/output error

    [ 214.411859] EXT4-fs error (device dm-1): __ext4_get_inode_loc:3806:
    inode #43384833: block 173539360: comm ls: unable to read itable block
    [ 221.067689] EXT4-fs error (device dm-1): __ext4_get_inode_loc:3806:
    inode #43384833: block 173539360: comm cp: unable to read itable block

    I can’t get anything off the drive. And what I have here are ideal conditions because it’s a brand new clean file system, no fragmentation, nothing about the LVM volume has been modified, no fsck done. So nothing is corrupt. It’s just missing a 1/4 hunk of its PE’s. I’d say an older production use fs has zero chance of recovery via mounting.

    So this is now a scraping operation with ext4.

    Chris Murphy

  • with classic LVM, you were supposed to use raid for your PV’s. The new LVM in 6.3+ has integrated raid at an LV level, you just have to declare all your LVs with appropriate raid levels.

  • I think since inception of LVM2, type mirror has been available which is now legacy (but still available). The current type since CentOS 6.3
    is raid1. But yes for anything raid4+ you previously had to create it with mdadm or use hardware RAID (which of course you can still do, most people still prefer managing software raid with mdadm than lvm’s tools).

  • And then Btrfs (no LVM). mkfs.btrfs -d single /dev/sd[bcde]
    mount /dev/sdb /mnt/bigbtr cp -a /usr /mnt/bigbtr

    Unmount. Poweroff. Kill 3rd of 4 drives. Poweron.

    mount -o degraded,ro /dev/sdb /mnt/bigbtr ## degraded,ro is required or mount fails cp -a /mnt/bigbtr/usr/ /mnt/btrfs ## copy to a different volume

    No dmesg errors. Bunch of I/O errors only when it was trying to copy data on the 3rd drive. But it continues.

    # du -sh /mnt/btrfs/usr
    2.5G usr

    Exactly 1GB was on the missing drive. So I recovered everything that wasn’t on that drive.

    One gotcha that applies to all three fs’s that I’m not testing: in-use drive failure. I’m simulate drive failure by first cleanly unmounting and powering off. Super ideal. How the file system and anything underneath it (LVM and maybe RAID) handles drive failures while in use, is a huge factor.

    Chris Murphy

  • It has its uses, just like RAID0 has uses. But yes, as the number of drives in the pool increases, the risk of catastrophic failure increases. So you have to bet on consistent backups and be OK with any intervening dataloss. If not, well, use RAID1+ or use a distributed-replication cluster like GlusterFS or Ceph.

    I agree. I kind get a wee bit aggressive and say, if you don’t have backups the data is by (your own) definition not important.

    Anyway, changing the underlying storage as little as possible gives the best chance of success. linux-raid@ list is full of raid5/6
    implosions due to people panicking, reading a bunch of stuff, not identifying their actual problem, and just start typing a bunch of commands and end up with user induced data loss.

    In the case of this thread, I’d say the best chance for success is to not remove or replace the dead PV, but to do a partial activation.
    # vgchange -a y –activationmode partial

    And then ext4 it’s a scrape operation with debugfs -c. And for XFS
    looks like some amount of data is possibly recoverable with just an ro mount. I didn’t try any scrape operation, too tedious to test.

  • Indeed. That is why: no LVMs in my server room. Even no software RAID. Software RAID relies on the system itself to fulfill its RAID function;
    what if kernel panics before software RAID does its job? Hardware RAID
    (for huge filesystems I can not afford to back up) is what only makes sense for me. RAID controller has dedicated processors and dedicated simple system which does one simple task: RAID.

    Just my $0.02

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • Biggest problem is myriad defaults aren’t very well suited for multiple device configurations. There are a lot of knobs in Linux and on the drives and in hardware RAID cards. None of this is that simple.

    Drives, and hardware RAID cards are subject to firmware bugs, just as we have software bugs in the kernel. We know firmware bugs cause corruption. Not all hardware RAID cards are the same, some are total junk. Many others get you vendor lock in due to proprietary metadata written to the drives. You can’t get your data off if the card dies, you have to buy a similar model card sometimes with the same firmware version in order to regain access. Some cards support SNIA’s DDF
    format, in which case there’s a chance mdadm can assemble the array, should the hardware card die.

    Anyway, the main thing is knowing where the land mines are regardless of what technology you pick. If you don’t know where they are, you’re inevitably going to run into trouble with anything you choose.

  • Speaking of which: Only good hardware cards are the ones I would use, and only good external RAID boxes. Over last decade and a half I never had trouble due to firmware bugs of RAIDs. What I use is:

    1. 3ware (mostly)
    2. LSI megaraid (a few, I don’t like their user interface and poor notification abilities)
    3. Areca (also a few, better UI than that of LSI)

    External RAID boxes: Infortrend

    I never will go for cheepy fake RAID (adaptec is one off the top of my head). Also, it was not my choice but I had to deal with Hm… not good external RAID boxes: by Promise, and by Raid.com to mention two.

    You are implying that firmware of hardware RAID cards is somehow buggier than software of software RAID plus Linux kernel (sorry if I
    misinterpreted your point). I disagree: embedded system of RAID card and RAID function they have to fulfill are much simpler than everything involved into software RAID. Therefore, with the same effort invested, firmware of (good) hardware is less buggy. And again, Linux kernel can be panicked more likely than trivial embedded system of hardware RAID
    card/box. At least my experience over decade and a half confirms that.

    I have heard horror stories from people who used the same good hardware I
    mentioned (3ware). However, when I went in each case deep into detail I
    discovered that they just didn’t have all necessary set up correctly, which it trivial as a matter of fact. Namely: common mistake in all cases was: not setting RAID verify cron task (it is set on the RAID
    configuration level). I have my raids verified once a week. If you don’t verify them for a year, what happens then: you don’t discover individual drive degradation until it is too late and larger number than the level of redundancy are kicked out because of fatal failures. Even then 3ware when it is already not redundant doesn’t kick out newly failing drives, just makes RAID read-only, so you still can salvage something. Anyway, these horror stories were purely poor sysadmin’s job IMHO.

    I would not consider that a disadvantage. I still have to see a 3ware card dead (yes, you can burn that if you plug it into slot with gross misalignment like tilt). And with 3ware, later model will accept drives originally making up RAID on older model, only it will make RAID read only, thus you can salvage your data first, then you can re-create RAID
    with this new card’s (metadata standard). I guess, I may have different philosophy than you do. If I use RAID card, I choose indeed good one. Once I use the good one, I feel no need moving drives to card made by different manufacturer. And last, yet important thing: if you have to use these drives with different card (even just different model by the same manufacturer) then you better re-create RAID from scratch on this new card. If you value your data…

    Just my $0.02

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • It’s a good point. Suggesting the OP’s problem is an example why LVM
    should not be used, is like saying dropped laptops is a good example why laptops shouldn’t be used.

    A fair criticism is whether LVM should be used by default with single disk system installations. I’ve always been suspicious of this choice.
    (But now, even Apple does this on OS X by default, possibly as a prelude to making full volume encryption a default – their “LVM”
    equivalent implements encryption as an LV level attribute called logical volume family.)

  • “Drives, and hardware RAID cards are subject to firmware bugs, just as we have software bugs in the kernel.” makes no assessment of how common such bugs are relative to each other.

    There’s no evidence provided for this. All I’ve stated is bugs happen in both software and the firmware on hardware RAID cards. http://www.cs.toronto.edu/~bianca/papers/fast08.pdf

    And further there’s a widespread misperception that RAID56 (whether software or hardware) is capable of detecting and correcting corruption.

    I’d say this is not a scientific sample and therefore unproven. I can provide my own non-scientific sample: an XServe running OS X with software raid1 which has never, in 8 years, kernel panicked. Its longest uptime was over 500 days, and was only rebooted due to a system upgrade that required it. There’s nothing special about the XServe that makes this magic, it’s just good hardware with ECC memory, enterprise SAS drives, and a capable though limited kernel. So there’s no good reason to expect kernel panics. Having them means something is wrong.

    This is a common problem on software and hardware RAID alike, the lack of scrubbing. Also recognize that software raid tends to bring along cheaper drives that aren’t well suited for RAID use, whereas people spending money on hardware raid tend to invest in appropriate drives. That prevents problems due to proper SCT ERC settings on the drive.

    I agree. This is common in any case.

  • I’ll better qualify this. For CentOS it’s a fine default, as it is for Fedora Server. For Workstation and Cloud I think LVM overly complicates things. More non-enterprise users get confused over LVM
    than they ever have a need to resize volumes.

    XFS doesn’t support shrink, only grow. XFS is the CentOS 7 default. The main advantage of LVM for CentOS system disks is ability to use pvmove to replace a drive online, rather than resize. If Btrfs stabilizes sufficiently for RHEL/CentOS 8, overall it’s a win because it meets the simple need of mortal users and supports advanced features for advanced users. (Ergo I think LVM is badass but it’s also the storage equivalent of emacs – managing it is completely crazy.)

    Yeah my bad for partly derailing this thread. Hopefully the original poster hasn’t been scared off, not least of which may be due to my bark about cross posting being worse than my bite.

  • Dear Chris, James, Valeri and all,

    Sorry to have not responded as I’m still on struggling with the recovery with no success.

    I’ve been trying to set up a new system with the exact same scenario (4
    2TB hard drives and remove the 3rd one afterwards). I still cannot recover.

    We did have a backup system but it went bad for a while and we did not have replacement on time until this happened.

    From all of your responses, it seems, recovery is almost impossible. I’m now trying to look at the hardware part and get the damaged hard drive to fixed.

    I appreciate all you helps and still wait and listen to more suggestions.

    Regards, Khem

  • There may be a bit expensive route. Depending on how valuable the data are, you may think of contacting professional recovery services. They usually take about a Month, they are expensive. Decent ones will be on the order of $1000 if it is a single drive. Likely more if it is fatally failed RAID. You can do your research and find good ones close to you. The rule of thumb is: if they only charge in case of more or less successful recovery (and sometimes they can recover almost 100%, sometimes 70-80%
    sometimes nothing – then they will not charge you), then it probably is decent company. They live from results of their work. If they charge for
    “estimate” even if they tell later they can not recover, this is bad sign. They work with fine equipment to read stuff off the platters of died drives. They work on the level of debugging of filesystems (and RAIDs), so what they charge is usually not that much for the kind of work they do. If you don’t feel you are that level of expert as they are, and the data is worth it, I would contact recovery services. I myself usually have good backup (knocking on wood), but I know several people who actually used some of these companies, and their data got recovered. If you come to the point of need some references, contact me off the list, I’ll dig up my old emails, and will send you what people (whom I know in person) say about the companies they used successfully.

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • Well, it’s effectively a raid0. While it’s not block level striping, it’s a linear allocation, the way ext4 and XFS write, you’re going to get file extents and fs metadata strewn across all four drives. As soon as any one drive is removed the whole thing is sufficiently damaged it can’t recover without a lot of work. Imagine a (really bad example physics wise) single drive scenario and magically punching a hole through a drive such that it’ll still spin. The fs on that drive is doing to have all sorts of problems because of the hole, even if it can read 3/4 of the drive.

    About the best case scenario with such a situation is literally do nothing with the LVM setup, and send that PV off for block level data recovery (you didn’t say how it failed but I’m assuming it’s beyond the ability to fix it locally). Then once the recovered replacement PV
    is back in the setup, things will just work again. *shrug* LVM linear isn’t designed to be fail safe in the face of a single device failure.

  • Actually, I’m probably wrong in the previous post about sending off the single bad PV for recovery. Your point above made me think, umm yeah no, pretty much any company specializing in data recovery will want the entire array/LV backing drives, even the good ones. Same for RAID, they probably don’t want just the dead drive, they want the whole thing. And they charge by the total size. So, yeah probably a lot more than $1K.

    Agreed.