Lvm Cache + Qemu-kvm Stops Working After About 20GB Of Writes

Home » CentOS-Virt » Lvm Cache + Qemu-kvm Stops Working After About 20GB Of Writes
CentOS-Virt 1 Comment

Hello,

I would really appreciate some help/guidance with this problem. First of all sorry for the long message. I would file a bug, but do not know if it is my fault, dm-cache, qemu or (probably) a combination of both. And i can imagine some of you have this setup up and running without problems (or maybe you think it works, just like i did, but it does not):

PROBLEM
LVM cache writeback stops working as expected after a while with a qemu-kvm VM. A 100% working setup would be the holy grail in my opinion… and the performance of KVM/qemu is great i must say in the beginning.

DESCRIPTION

When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and create a cached LV out of them, the VM performs initially great (at least
40.000 IOPS on 4k rand read/write)! But then after a while (and a lot of random IO, ca 10 – 20 G) it effectively turns in to a writethrough cache although there’s much space left on the cachedlv.

When working as expected on KVM host all writes go to SSDs

iostat -x -m 2

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 324.50 0.00 22.00 0.00 14.94
1390.57 1.90 86.39 0.00 86.39 5.32 11.70
sdb 0.00 324.50 0.00 22.00 0.00 14.94
1390.57 2.03 92.45 0.00 92.45 5.48 12.05
sdc 0.00 3932.00 0.00 *2191.50* 0.00 *270.07*
252.39 37.83 17.55 0.00 17.55 0.36 *78.05*
sdd 0.00 3932.00 0.00 *2197.50 * 0.00 *271.01 *
252.57 38.96 18.14 0.00 18.14 0.36 *78.95*

When not working as expected on KVM host all writes go through the SSD
on to the HDDs (effectively disabling writeback so it becomes a writethrough)

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 7.00 234.50 *173.50 * 0.92 *1.95*
14.38 29.27 71.27 111.89 16.37 2.45 *100.00*
sdb 0.00 3.50 212.00 *177.50 * 0.83 *1.95*
14.60 35.58 91.24 143.00 29.42 2.57*100.10*
sdc 2.50 0.00 566.00 *199.00 * 2.69 0.78
9.28 0.08 0.11 0.13 0.04 0.10 *7.70*
sdd 1.50 0.00 76.00 *199.00* 0.65 0.78
10.66 0.02 0.07 0.16 0.04 0.07 *1.85*

Stuff i’ve checked/tried:

– The data in the cached LV has then not exceeded even half of the space, so this should not happen. It even happens when only 20% of cachedata is used.
– It seems to be triggerd most of the time when %cpy/sync column of `lvs
-a` is about 30%. But this is not always the case!
– changing the cachepolicy from cleaner to smq, wait (check flush ready with lvs -a) and then back to smq seems to help /sometimes/! But not always…

lvchange –cachepolicy cleaner /dev/mapper/XXX-cachedlv

lvs -a

lvchange –cachepolicy smq /dev/mapper/XXX-cachedlv

– *when mounting the LV inside the host this does not seem to happen!!*
So it looks like a qemu-kvm / dm-cache combination issue. Only difference is that inside host i do mkfs in stead of LVM inside VM (so could be LVM inside VM on top of LVM on KVM host problem too? small chance probably because the first 10 – 20GB it works great!)

– tried disabling Selinux, upgrading to newest kernels (elrepo ml and lt), played around with dirty_cache thingeys like proc/sys/vm/dirty_writeback_centisecs
/proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio , and migration threashold of dmsetup, and other probably non important stuff like vm.dirty_bytes

– when in “slow state” the systems kworkers are exessively using IO (10
– 20 MB per kworker process). This seems to be the writeback process
(CPY%Sync) because the cache wants to flush to HDD. But the strange thing is that after a good sync (0% left), the disk may become slow again after a few MBs of data. A reboot sometimes helps.

– have tried iothreads, virtio-scsi, vcpu driver setting on virtio-scsi controller, cachesettings, disk shedulers etc. Nothing helped.

– the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have AMD
FX(tm)-8350, 16G RAM

It feels like the lvm cache has a threshold (about 20G of data that is dirty) and that is stops allowing the qemu-kvm process to use writeback caching (the root uses inside the host seems to not have this limitation). It starts flushing, but only to a certain point. After a few MBs of data it is right back in the slow spot again. Only solution is waiting for a long time (independant of CPY%SYNC) or sometimes change cachepolicy and force flush. This prevents for me the production use of this system. But it’s so promising, so I hope somebody can help.

desired state: Doing the FIO test (described in section reproduce)
repeatedly should keep being fast till cachedlv is more or less full. If resyncing back to disc causes this degradation, it should actually flush it fully within a reasonable time and give opportunity to write fast again up to a given threshold. It now seems like a one time use cache that only uses a fraction of the SSD and is useless/very unstable afterwards.

REPRODUCE
1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep a lot of space for the LVM cache (no /home)! So make the VG as large as possible during anaconda partitioning.

2. once installed and booted in to the system, install qemu-kvm

yum install -y CentOS-release-qemu-ev yum install -y qemu-kvm-ev libvirt bridge-utils net-tools
# disbale ksm (probably not important / needed)
systemctl disable ksm systemctl disable ksmtuned

3. create LVM cache

#set some variables and create a raid1 array with the two SSDs

VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1 &&
hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX && mdadm –create
–verbose ${ssdraiddevice} –level=mirror –bitmap=none –raid-devices=2
${ssddevice1} ${ssddevice2}

# create PV and extend VG

pvcreate ${ssdraiddevice} && vgextend ${VGBASE} ${ssdraiddevice}

# create slow LV on HDDs (use max space left if you want)

pvdisplay ${hddraiddevice}
lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}

# create the meta and data: for testing purposes I keep about 20G of the SSD for a uncached lv. To rule out it is not the SSD.

lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}

#The rest can be used as cachedata/metadata.

pvdisplay ${ssdraiddevice}
# about 1/1000 of the space you have left on the SSD for the meta
(minimum of 4)
lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}
# the rest can be used as cachedata
lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}

# convert/combine pools so cachedlv is actually cached

lvconvert –type cache-pool –cachemode writeback –poolmetadata
${VGBASE}/cachemeta ${VGBASE}/cachedata

lvconvert –type cache –cachepool ${VGBASE}/cachedata ${VGBASE}/cachedlv

# my system now looks like (VG is called cl, default of installer)
[root@localhost ~]# lvs -a
LV VG Attr LSize Pool Origin
[cachedata] cl Cwi—C— 97.66g
* [cachedata_cdata] cl Cwi-ao—- 97.66g **
** [cachedata_cmeta] cl ewi-ao—- 100.00m *
* cachedlv cl Cwi-aoC— 1.75t [cachedata] [cachedlv_corig] *
[cachedlv_corig] cl owi-aoC— 1.75t
[lvol0_pmspare] cl ewi——- 100.00m
root cl -wi-ao—- 46.56g
swap cl -wi-ao—- 14.96g
* testssd cl -wi-a—– 45.47g

*[root@localhost ~]#lsblk*
*
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdd 8:48 0 163G 0 disk
└─sdd1 8:49 0 163G 0 part
└─md128 9:128 0 162.9G 0 raid1
├─cl-cachedata_cmeta 253:4 0 100M 0 lvm
│ └─cl-cachedlv 253:6 0 1.8T 0 lvm
├─cl-testssd 253:2 0 45.5G 0 lvm
└─cl-cachedata_cdata 253:3 0 97.7G 0 lvm
└─cl-cachedlv 253:6 0 1.8T 0 lvm sdb 8:16 0 1.8T 0 disk
├─sdb2 8:18 0 1.8T 0 part
│ └─md127 9:127 0 1.8T 0 raid1
│ ├─cl-swap 253:1 0 15G 0 lvm [SWAP]
│ ├─cl-root 253:0 0 46.6G 0 lvm /
│ └─cl-cachedlv_corig 253:5 0 1.8T 0 lvm
│ └─cl-cachedlv 253:6 0 1.8T 0 lvm
└─sdb1 8:17 0 954M 0 part
└─md126 9:126 0 954M 0 raid1 /boot sdc 8:32 0 163G 0 disk
└─sdc1 8:33 0 163G 0 part
└─md128 9:128 0 162.9G 0 raid1
├─cl-cachedata_cmeta 253:4 0 100M 0 lvm
│ └─cl-cachedlv 253:6 0 1.8T 0 lvm
├─cl-testssd 253:2 0 45.5G 0 lvm
└─cl-cachedata_cdata 253:3 0 97.7G 0 lvm
└─cl-cachedlv 253:6 0 1.8T 0 lvm sda 8:0 0 1.8T 0 disk
├─sda2 8:2 0 1.8T 0 part
│ └─md127 9:127 0 1.8T 0 raid1
│ ├─cl-swap 253:1 0 15G 0 lvm [SWAP]
│ ├─cl-root 253:0 0 46.6G 0 lvm /
│ └─cl-cachedlv_corig 253:5 0 1.8T 0 lvm
│ └─cl-cachedlv 253:6 0 1.8T 0 lvm
└─sda1 8:1 0 954M 0 part
└─md126 9:126 0 954M 0 raid1 /boot

# now create vm wget http://ftp.tudelft.nl/CentOS.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso
-P /home/
DISK=/dev/mapper/XXXX-cachedlv

# watch out, my netsetup uses a custom bridge/network in the following command. Please replace with what you normally use. virt-install -n CentOS1 -r 12000 –os-variant

One thought on - Lvm Cache + Qemu-kvm Stops Working After About 20GB Of Writes

  • Adding Paolo and Miroslav.

    On Sat, Apr 8, 2017 at 4:49 PM, Richard Landsman – Rimote