Soft Lockups With Xen4CentOS 3.18.25-18.el6.x86_64

Home » CentOS-Virt » Soft Lockups With Xen4CentOS 3.18.25-18.el6.x86_64
CentOS-Virt 3 Comments

I’ve been running 3.18.25-18.el6.x86_64 + our build of xen 4.4.3-9 on one host for the last couple of weeks and have gotten several soft lockups within the last 24 hours. I am posting here first in case anyone else has experienced the same issue.

Here is the first instance:

sched: RT throttling activated NMI watchdog: BUG: soft lockup – CPU#0 stuck for 22s! [swapper/0:0]
Modules linked in: ebt_arp xen_pciback xen_gntalloc ebt_ip ebtable_filter ebtables ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4
iptable_filter ip_tables xt_physdev br_netfilter bridge stp llc ip6t_REJECT nf_reject_ipv6 nf_c onntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd joydev sg 8250_fintek serio_raw gpio_ich iTCO_wdt iTCO_vendor_su pport coretemp intel_powerclamp crct10dif_pclmul crc32_pclmul crc32c_intel pcspkr i2c_i801 lpc_ich igb ptp pps_core hwmon ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache raid10 raid1 sd_mod mptsas mptscsih mptbase scsi_transpor t_sas aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 ahci libahci mgag200 ttm drm_kms_helper dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.25-18.el6.x86_64 #1
Hardware name: Supermicro X8DTN+-F/X8DTN+-F, BIOS 080016 08/03/2011
task: ffffffff81c1b4c0 ti: ffffffff81c00000 task.ti: ffffffff81c00000
RIP: e030:[] [] xenvif_tx_build_gops+0xa5/0x890 [xen_netback]
RSP: e02b:ffff88013f403ca8 EFLAGS: 00000206
RAX: 000000000000003c RBX: ffffc90012c28000 RCX: ffffc90012c280d0
RDX: 0000000000071ea1 RSI: 0000000000000040 RDI: ffffc90012c28000
RBP: ffff88013f403e38 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000071e65
R13: ffff88013f403e50 R14: 000000000000003c R15: 0000000000000032
FS: 00007fe942ac7980(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8000007f6800 CR3: 00000000bcd4c000 CR4: 0000000000002660
Stack:
ffff88013f403d30 ffff880006de6800 ffff8800adfee1c0 ffff88013f403e54
000000015786e75a 00000040a0351878 ffffc90012c2def0 000000015786e73c ffffc90012c280d0 ffffc90012c2def0 ffffffffa03516c0 ffff8800bf29e000
Call Trace:

[] ? br_handle_frame_finish+0x3f0/0x3f0 [bridge]
[] ? __netif_receive_skb_core+0x1ee/0x640
[] ? __netif_receive_skb+0x27/0x70
[] ? netif_receive_skb_internal+0x2d/0x90
[] ? igb_alloc_rx_buffers+0x63/0xe0 [igb]
[] xenvif_tx_action+0x4d/0xa0 [xen_netback]
[] xenvif_poll+0x35/0x68 [xen_netback]
[] net_rx_action+0x112/0x2a0
[] __do_softirq+0xfc/0x2b0
[] irq_exit+0xbd/0xd0
[] xen_evtchn_do_upcall+0x3c/0x50
[] xen_do_hypervisor_callback+0x1e/0x40

[] ? xen_hypercall_sched_op+0xa/0x20
[] ? xen_hypercall_sched_op+0xa/0x20
[] ? xen_safe_halt+0x10/0x20
[] ? default_idle+0x24/0xc0
[] ? arch_cpu_idle+0xf/0x20
[] ? cpuidle_idle_call+0xd6/0x1d0
[] ? __atomic_notifier_call_chain+0x12/0x20
[] ? cpu_idle_loop+0x135/0x1e0
[] ? cpu_startup_entry+0x1b/0x70
[] ? cpu_startup_entry+0x60/0x70
[] ? rest_init+0x77/0x80
[] ? start_kernel+0x441/0x448
[] ? set_init_arg+0x5d/0x5d
[] ? x86_64_start_reservations+0x2a/0x2c
[] ? xen_start_kernel+0x5ef/0x5f1
Code: 00 0f 87 06 07 00 00 44 8b b3 b8 00 00 00 44 03 b3 c0 00 00 00 45 29 e6 41 39 c6 44 0f 47 f0 45 85 f6 0f 84 8f 00 00 00 0f ae e8 <8b> 83 c0 00
00 00 83 e8 01 44 21 e0 48 8d 04 40 48 c1 e0 02 48

Of the remaining lockups, here is the common backtrace with the exception that there have been two instances of RIP being in net_rx_action:

[] net_rx_action+0x112/0x2a0
[] __do_softirq+0xfc/0x2b0
[] irq_exit+0xbd/0xd0
[] xen_evtchn_do_upcall+0x3c/0x50
[] xen_do_hypervisor_callback+0x1e/0x40

[] ? xen_hypercall_sched_op+0xa/0x20
[] ? xen_hypercall_sched_op+0xa/0x20
[] ? xen_safe_halt+0x10/0x20
[] ? default_idle+0x24/0xc0
[] ? arch_cpu_idle+0xf/0x20
[] ? cpuidle_idle_call+0xd6/0x1d0
[] ? __atomic_notifier_call_chain+0x12/0x20
[] ? cpu_idle_loop+0x135/0x1e0
[] ? cpu_startup_entry+0x1b/0x70
[] ? cpu_startup_entry+0x60/0x70
[] ? rest_init+0x77/0x80
[] ? start_kernel+0x441/0x448
[] ? set_init_arg+0x5d/0x5d
[] ? x86_64_start_reservations+0x2a/0x2c
[] ? xen_start_kernel+0x5ef/0x5f1

I can post more complete backtraces if that information would be useful to someone.

3 thoughts on - Soft Lockups With Xen4CentOS 3.18.25-18.el6.x86_64

  • Here is mpstat from around the time of the issue:

    0:08:56 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
    10:09:10 PM all 0.00 0.00 66.67 0.00 0.00 33.33 0.00 0.00 0.00
    10:09:11 PM all 2.17 0.00 5.43 32.61 0.00 58.70 1.09 0.00 0.00
    10:09:12 PM all 0.00 0.00 1.15 0.00 0.00 85.06 0.00 0.00 13.79
    10:09:13 PM all 0.00 0.00 1.08 0.00 0.00 83.87 0.00 0.00 15.05
    10:09:14 PM all 0.00 0.00 1.10 0.00 0.00 83.52 0.00 0.00 15.38
    10:09:15 PM all 1.09 0.00 1.09 0.00 0.00 85.87 0.00 0.00 11.96
    10:09:51 PM all 0.00 0.00 1.09 0.00 0.00 84.78 1.09 0.00 13.04
    Message from syslogd at Mar 9 22:09:51 … kernel:NMI watchdog: BUG: soft lockup – CPU#0 stuck for 22s! [swapper/0:0]
    10:10:02 PM all 0.00 0.00 33.33 50.00 0.00 16.67 0.00 0.00 0.00
    10:10:03 PM all 3.16 0.00 10.53 8.42 0.00 2.11 1.05 0.00 74.74
    10:10:04 PM all 0.00 0.00 3.23 38.71 0.00 1.08 1.08 0.00 55.91
    10:10:05 PM all 0.00 0.00 4.30 11.83 0.00 3.23 1.08 0.00 79.57

    Typical load:

    10:22:15 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
    10:22:16 PM all 0.00 0.00 1.02 0.00 0.00 1.02 0.00 0.00 97.96
    10:22:17 PM all 0.00 0.00 0.00 0.00 0.00 0.00 1.04 0.00 98.96
    10:22:18 PM all 0.00 0.00 0.00 0.00 0.00 1.01 1.01 0.00 97.98
    10:22:19 PM all 0.00 0.00 1.01 0.00 0.00 1.01 0.00 0.00 97.98
    10:22:20 PM all 0.00 0.00 0.00 0.00 0.00 0.00 1.02 0.00 98.98
    10:22:21 PM all 0.00 0.00 1.02 0.00 0.00 1.02 0.00 0.00 97.96
    10:22:22 PM all 0.00 0.00 0.00 0.00 0.00 1.01 1.01 0.00 97.98

    I reverted to an older kernel since the older kernel had run for a couple of months without issues.

  • This did not fix it. I isolated the issue to a vif rate limit of 100Mb/s being applied to one of the guests and am now able to reproduce on a different machine.

    I will look into whether this has been fixed already; if so I will submit a pull request for the Xen4CentOS kernel and if not I will take it up with the xen-devel list.

  • Yes, I was going to suggest posting this to xen-users — it’s not unlikely someone has already run across this.

    -George