Huge Write Amplification With Thin Provisioned Logical Volumes

Home » CentOS » Huge Write Amplification With Thin Provisioned Logical Volumes
CentOS No Comments

Hi,

I’ve noticed huge write amplification problem with thinly provisioned logical volumes and I wondered if anyone can explain why it happens and if and how can be fixed. The behavior is the same on CentOS 6.8 and CentOS
7.2.

I have a NVME card (Intel DC P3600 -2 TB) on which I create a thinly provisioned logical volume:

pvcreate /dev/nvme0n1
vgcreate vgg /dev/nvme0n1
lvcreate -l100%FREE -T vgg/thinpool lvcreate -V40000M -T vgg/thinpool -n brick1
mkfs.xfs /dev/vgg/brick1

If I run a write test ( dd if=/dev/zero of=./zero.img bs=4k count0000
oflag=dsync ) I see in iotop that the actual disk write is 30 times the amount of data that I’m actually writing to disk).

Total DISK READ: 0.00 B/s | Total DISK WRITE: 1001.23 M/s
TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO
COMMAND
10:59:53 34453 be/4 root 0.00 B/s 30.34 M/s 0.00 % 12.10 % dd if=/dev/zero of=./zero.img bs=4k count0000 oflag=dsync Total DISK READ: 0.00 B/s | Total DISK WRITE: 991.92 M/s
TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO
COMMAND
10:59:54 34453 be/4 root 0.00 B/s 30.05 M/s 0.00 % 12.63 % dd if=/dev/zero of=./zero.img bs=4k count0000 oflag=dsync Total DISK READ: 0.00 B/s | Total DISK WRITE: 1024.52 M/s
TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO
COMMAND
10:59:55 34453 be/4 root 0.00 B/s 31.05 M/s 0.00 % 12.49 % dd if=/dev/zero of=./zero.img bs=4k count0000 oflag=dsync
10:59:55 1057 be/3 root 0.00 B/s 15.39 K/s 0.00 % 0.01 %
[jbd2/sda1-8]
Total DISK READ: 0.00 B/s | Total DISK WRITE: 967.60 M/s
TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO
COMMAND
10:59:56 34453 be/4 root 0.00 B/s 29.32 M/s 0.00 % 12.75 % dd if=/dev/zero of=./zero.img bs=4k count0000 oflag=dsync Total DISK READ: 0.00 B/s | Total DISK WRITE: 943.66 M/s
TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO
COMMAND
10:59:58 34453 be/4 root 0.00 B/s 28.60 M/s 0.00 % 11.79 % dd if=/dev/zero of=./zero.img bs=4k count0000 oflag=dsync
10:59:58 34448 be/4 root 0.00 B/s 3.84 K/s 0.00 % 0.00 % python
/usr/sbin/iotop -o -b -t Total DISK READ: 0.00 B/s | Total DISK WRITE: 959.40 M/s
TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO
COMMAND
10:59:59 34453 be/4 root 0.00 B/s 29.07 M/s 0.00 % 11.81 % dd if=/dev/zero of=./zero.img bs=4k count0000 oflag=dsync Total DISK READ: 0.00 B/s | Total DISK WRITE: 948.38 M/s
TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO
COMMAND
11:00:00 34453 be/4 root 0.00 B/s 28.73 M/s 0.00 % 11.57 % dd if=/dev/zero of=./zero.img bs=4k count0000 oflag=dsync

For a 30MB/s write at the application level I get around 1000MB/s write at the device level, i.e. a 33x amplification.

On CentOS 6 if I try to align the data using the values from https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Setting%20Up%20Volumes/
I get only a 7x amplification. On Cetos 7 I can see the same 7x amplification using the default lvcreate options. This is the CentOS 7 iotop output:

12:48:29 Total DISK READ : 0.00 B/s | Total DISK WRITE : 32.24
M/s
12:48:29 Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 226.63
M/s
TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO
COMMAND
12:48:29 15234 be/3 root 0.00 B/s 3.80 K/s 0.00 % 35.20 %
[jbd2/dm-8-8]
12:48:29 15258 be/4 root 0.00 B/s 32.24 M/s 0.00 % 10.64 % dd if=/dev/zero of=./zero.img bs=4k count0000 oflag=dsync
12:48:29 14870 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.05 %
[kworker/u80:1]
12:48:29 15240 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.03 %
[kworker/u80:2]
12:48:29 15255 be/4 root 0.00 B/s 3.80 K/s 0.00 % 0.00 % python
/usr/sbin/iotop -o -b -t
12:48:30 Total DISK READ : 0.00 B/s | Total DISK WRITE : 31.97
M/s
12:48:30 Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 224.85
M/s
TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO
COMMAND
12:48:30 15234 be/3 root 0.00 B/s 0.00 B/s 0.00 % 35.14 %
[jbd2/dm-8-8]
12:48:30 15258 be/4 root 0.00 B/s 31.97 M/s 0.00 % 10.61 % dd if=/dev/zero of=./zero.img bs=4k count0000 oflag=dsync
12:48:30 14870 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.05 %
[kworker/u80:1]
12:48:30 15240 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.03 %
[kworker/u80:2]
12:48:31 Total DISK READ : 0.00 B/s | Total DISK WRITE : 32.50
M/s
12:48:31 Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 228.94
M/s
TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO
COMMAND
12:48:31 15234 be/3 root 0.00 B/s 0.00 B/s 0.00 % 35.28 %
[jbd2/dm-8-8]
12:48:31 15258 be/4 root 0.00 B/s 32.48 M/s 0.00 % 10.72 % dd if=/dev/zero of=./zero.img bs=4k count0000 oflag=dsync

Still 7x write amplifications seems too much. Has anyone seen this or has any explanation for it? I am rewriting the same file with dd multiple times so the filesystem and thin lvm should use already provisioned space.

Best regards, Radu