Xen C6 Kernel 4.9.13 And Testing 4.9.15 Only Reboots.

Home » CentOS-Virt » Xen C6 Kernel 4.9.13 And Testing 4.9.15 Only Reboots.
CentOS-Virt 38 Comments

Updating my CentOS 6.8 Xen server with new 4.9.13 kernel yields a kernel boot message of a few like “APIC ID MISMATCH” and the system reboots immediately without any other bits of info. This is on a Dell R710 with
64GB RAM and 2x 6-core Intel CPU‘s. As an additional test, I installed and attempted to run the current
“testing” kernel of 4.9.16 with the exact same results.

Anyone have an idea? The 3.18.x series runs without issue of course.

Thanks

PJ

38 thoughts on - Xen C6 Kernel 4.9.13 And Testing 4.9.15 Only Reboots.

  • –pikV7ba55IQrshWhdVLe5btIVjo0LEi5F
    Content-Type: text/plain; charset=windows-1252
    Content-Transfer-Encoding: quoted-printable

    Try this kernel (the noarch kernel-doc is not done yet), but that is not a required package:

    https://people.CentOS.org/hughesjr/4.9.16/x86_64/

    Let me know if that works or not .. we can try adjusting some other config settings.

    Don’t worry about the CentOS.plus dist tag .. that will change when we subnit it via the regular process.

    Thanks, Johnny Hughes

    –pikV7ba55IQrshWhdVLe5btIVjo0LEi5F

  • –PunEwd0AOaEas6Kep5IDWt6pLK7K865a2
    Content-Type: text/plain; charset=windows-1252
    Content-Transfer-Encoding: quoted-printable

    I think the APIC ID MISMATCH is an expected and ignorable error .. see:

    https://patchwork.kernel.org/patch/9539933/

    I applied that patch and I am building a 4.9.16-23 right now, I ‘ll publish it when it finishes. Maybe with the error gone we can get a better error in the console.

    –PunEwd0AOaEas6Kep5IDWt6pLK7K865a2

  • –FU3vWFFe2NO0VnlvuuphlPWhkdH6QiuLe Content-Type: text/plain; charset=windows-1252
    Content-Transfer-Encoding: quoted-printable

    Try the new 4.9.16-24 packages there now. (reworked the config based on a fedora kernel)

    –FU3vWFFe2NO0VnlvuuphlPWhkdH6QiuLe

  • “noreboot” grub.conf option still produced nothing other than a flashing cursor on the top left. Also, neither num-lock nor caps-lock respond at this time… I seem no closer with helpful information other than, “it’s broken” :(
    Here is the grub.conf stanza for the kernel:
    title CentOS (4.9.16-24.el6.CentOS.plus.x86_64)
    root (hd0,1)
    kernel /boot/xen.gz dom0_mem=3G,max:3G cpuinfo com1=115200,8n1
    console=com1,tty loglvl=all gue st_loglvl=all noreboot module /boot/vmlinuz-4.9.16-24.el6.CentOS.plus.x86_64 ro root=UUID=bc0727e1-882c-4fbc-a4d9-e4c f754d72b7 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD
    SYSFONT=latarcyrheb-sun16 crashkernel=auto K
    EYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet reboot=pci max_loop=64
    module /boot/initramfs-4.9.16-24.el6.CentOS.plus.x86_64.img

    Thanks PJ

  • The last few lines are NMI watchdog: disabled CPU0 hardware events not enabled NMI watchdog: shutting down hard lockup detector on all CPUS
    installing Xen timer for CPU1
    installing Xen timer for CPU2
    installing Xen timer for CPU3
    installing Xen timer for CPU4
    installing Xen timer for CPU5
    installing Xen timer for CPU6

    Here is the screen shot:
    https://goo.gl/photos/yNQqaQY9bJBWQ84X8
    It stops at CPU6. This is a dual socket server with 2x 6core L5639 CPUs (HT
    disabled). I’m surprised to see it stop at 6.

    Thanks PJ

  • As a follow up I was able to test fresh install on Dell R710 and a Dell R620 with success on CentOS 7.3 without issue on the new kernel. My new plan will be to just move this C6 to one of the C7 I just created.

  • That sounds like a compiler problem, since I think the C6 and C7 kernels are built from the same source.

    –Sarah

  • The mystery gets more interesting… I now have a CentOS 7.3 Dell R710
    server doing the exact same thing of rebooting immediately after the Xen kernel load. Just to note this is a second system and not just the first system with an update. I hope I’m not introducing something odd. They only
    “interesting” thing I have done for historical reasons is to change the following /etc/sysconfig/grub line:
    GRUB_CMDLINE_XEN_DEFAULT=”dom0_mem=6G,max:8G cpuinfo com15200,8n1
    console=com1,tty loglvl=all guest_loglvl=all”
    But I’ve done that on other servers without issue. In fact I have a Dell R710 that DOES work with CentOS 7 and the new kernel… so confused.

  • I ran into this also.

    back up to an older kernel. At least that was my solution till a kernel came out that would boot.

    It seems that some kernel builds are not friendly to xen.

  • –BBcMqOSfko8nSpnlnHfgSGOPmJTR7AD3e Content-Type: text/plain; charset=windows-1252
    Content-Transfer-Encoding: quoted-printable

    Maybe the BIOS versions are different on the two machines if they are the same models. Different disc controllers or modes set up? Different NICs or other add on cards?

    –BBcMqOSfko8nSpnlnHfgSGOPmJTR7AD3e

  • PJ,

    Thanks for your testing and report. Would you mind reporting this on xen-devel? If there’s actually a bug in the Linux 4.9.x on Xen boot path on your box, I don’t think Johnny or I are going to be able to help you debug it. :-)

    -George

  • –reXeDOJSDnUXaTIt0WCo5gnri9PV3F9tT
    Content-Type: text/plain; charset=windows-1252
    Content-Transfer-Encoding: quoted-printable

    OK, I have a new CentOS-6 4.9.20-26 kernel here for testing:

    https://people.CentOS.org/hughesjr/4.9.16/6/x86_64/

    I am building the el7 one right now as well, it will be at:

    https://people.CentOS.org/hughesjr/4.9.16/7/x86_64/

    George and I found some issues with the 4.9.x config files for the xen kernel. Hopefully this one is much more stable as it has many changes from the fedora/rhel type configs now (what is built into the kernel, what is loaded as a kernel module, etc.)

    Please test these kernels so we can get them released.

    Thanks, Johnny Hughes

    –reXeDOJSDnUXaTIt0WCo5gnri9PV3F9tT

  • CentOS-6 4.9.20-26 kernel exhibits the same constant kernel-start-then-reboot issue when booting under the “CentOS Linux, with Xen hypervisor” grub2 menu option. However, it *does* properly boot under the “CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)” grub2 menu option!

    A semi-close look at the /etc/grub2.cfg yields no discernible difference between a properly functional Dell R620 and the non-properly functioning Dell R710.

    Sorry, I had been distracted with other issues and have not yet submitted information to the xen-devel group yet.

    Thanks PJ

  • So interesting and challenging too, IT seems to Xen compatibility to Dell board BIOS related.

    I have Dell R710 and R7200 with same Xen version, but the outcome is completely different, that R720 is slow in performance and reboot too.

    xlord

  • I am having no similar issues with several Dell Proliant DL160p’s and CentOS 6. They are either G5 or G6, I don’t recall which.

    –Sarah

  • List moderator: feel free to delete my previous large message with attachments that’s in the moderation queue…it’s now obsolete anyway.

    I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + Kernel 4.9.13:

    Once I finally got serial output all the way through the boot process (xen+dom0) I discovered the stack trace:

    [Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7
    installing Xen timer for CPU 8
    [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20
    smpboot: Package 1 of CPU 8 exceeds BIOS package data 1.
    ————[ cut here ]———-

  • –8Vmx98lMhUfLdnguAtxVK9Ecgc37x6ISl Content-Type: text/plain; charset=windows-1252
    Content-Transfer-Encoding: quoted-printable

    Dave,

    Take a look at this kernel as it is the one I think we are going to release (or a slightly newer 4.9.2x from kernel.org LTS). This version has some newer settings that are more redhat/fedora/CentOS base kernel like WRT what is a module and what is built into the kernel, etc.

    https://people.CentOS.org/hughesjr/4.9.x/

    Thanks, Johnny Hughes

    –8Vmx98lMhUfLdnguAtxVK9Ecgc37x6ISl

  • Sad to say that I already tested 4.9.20-26 from your repo yesterday…it does look a little cleaner before it dies, but still dies. I have not tested it with the vcpu=4 wokaround, but I can tonight if you would like. Relevant bits below:

    Loading Xen 4.6.3-12.el7 … Loading Linux 4.9.20-26.el7.x86_64 … Loading initial ramdisk …
    [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 CDT 2017

    [ 6.195089] smpboot: Max logical packages: 1
    [ 6.199549] VPMU disabled by hypervisor.
    [ 6.203663] Performance Events: SandyBridge events, PMU not available due to virtualization, using software events only.
    [ 6.215436] NMI watchdog: disabled (cpu0): hardware events not enabled
    [ 6.222139] NMI watchdog: Shutting down hard lockup detector on all cpus
    [ 6.229165] installing Xen timer for CPU 1
    [ 6.233849] installing Xen timer for CPU 2
    [ 6.238504] installing Xen timer for CPU 3
    [ 6.243139] installing Xen timer for CPU 4
    [ 6.247836] installing Xen timer for CPU 5
    [ 6.252478] installing Xen timer for CPU 6
    [ 6.257155] installing Xen timer for CPU 7
    [ 6.261795] installing Xen timer for CPU 8
    [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1.
    [ 6.272736] ————[ cut here ]———-

  • So, strangely,

    I have two _identical_ dualproc xeon mobos (same bios/ipmi versions, they even share an enclosure, one is right side, other is left), each with different cpu/memory:

    Using 4.9.13 with vcpu limited to 4, early in the boot process, the one that _was_ booting before setting the xen vcpu args says:
    “[ 7.060720] smpboot: Max logical packages: 2”,

    and the other one says
    “[ 6.195089] smpboot: Max logical packages: 1”

    They both have dual procs, known working/good.

    The first (the one that worked unmodified) has dual 8 core (16 HT/ea) and correctly detects “[ 0.000000] smpboot: Allowing 32 CPUs, 0 hotplug CPUs”. It’s a Xeon E5-2665v1.

    The second machine (didn’t work without the xen vcpu args) has dual 4 core (8ht/ea) and also correctly detects “[ 0.000000] smpboot: Allowing 16 CPUs, 0 hotplug CPUs”. It’s a Xeon E5-2643v1…so it seems like this one does ok until it decides there’s only one cpu package?

    Thanks,
    -Dave

  • I also just realized the C6 portion of the title/subject line here refers to CentOS 6, so I’d like to clarify that all my testing/issues/etc was under CentOS 7.3 with all patches applied.

    Thanks,
    -Dave

  • –uF2kaHes1dqIA1Ox1lkp4c7KBvUGig1EU
    Content-Type: text/plain; charset=windows-1252
    Content-Transfer-Encoding: quoted-printable

    Dave,

    Just for testing purposes, can you try booting the kernel in the normal way on the machine does does not work (a normal grub entry on the kernel with no xen.gz line)

    That way, we can hopefully narrow the issue down to a hypervisor issue or a kernel config issue.

    Thanks, Johnny Hughes

    –uF2kaHes1dqIA1Ox1lkp4c7KBvUGig1EU

  • There was a note that the non-Xen kernel at the same kernel version did indeed boot:
    “CentOS-6 4.9.20-26 kernel exhibits the same constant kernel-start-then-reboot issue when booting under the “CentOS Linux, with Xen hypervisor” grub2 menu option. However, it *does* properly boot under the “CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)” grub2 menu option!”

    Trying to get back into being able to test this more.

    Thanks PJ

  • Just to note, the same pattern happens on C7:
    “CentOS Linux, with Xen hypervisor” = reboot
    “CentOS Linux (4.9.20-26.el7.x86_64) 7 (Core)” = boot

    [root@XXX ~]# uname -a Linux XXX 4.9.20-25.el7.x86_64 #1 SMP Fri Mar 31 08:53:28 CDT 2017 x86_64
    x86_64 x86_64

  • Apologies: I installed the newer -26 kernel and had not rebooted into it. The grub2 menu item should have been “CentOS Linux (4.9.20-25.el7.x86_64) 7
    (Core)”. I am currently restarting that remote affected system (unmodified grub2 entry first). Thanks PJ

  • Here is something interesting… I went through the BIOS options and found that one R710 that *is* functioning only differed in that “Logical Processor”/Hyperthreading was *enabled* while the one that is *not*
    functioning had HT *disabled*. Enabled Logical Processor and the system starts without issue! I’ve rebooted 3 times now without issue. Dell R710 BIOS version 6.4.0
    2x Intel(R) Xeon(R) CPU L5639 @ 2.13GHz
    4.9.20-26.el7.x86_64 #1 SMP Tue Apr 4 11:19:26 CDT 2017 x86_64 x86_64
    x86_64 GNU/Linux

  • –BKMA8Ooe61ChfIiveLH2TJfkOSjnCWH2E
    Content-Type: text/plain; charset=windows-1252
    Content-Transfer-Encoding: quoted-printable

    Outstanding .. I have now released a 4.9.23-26.el6 and .el7 to the system as normal updates. It should be available later today.

    –BKMA8Ooe61ChfIiveLH2TJfkOSjnCWH2E

  • I’ve verified with a second Dell R710 that disabling Hyperthreading/Logical Processor causes the primary xen booting kernel to fail and reboot. Consequently, enabling allows for the system to start as expected and without any issue:
    Current tested kernel was: 4.9.13-22.el7.x86_64 #1 SMP Sun Feb 26 22:15:59
    UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

    I just attempted an update and the 4.9.23-26 is not yet up. Does this update address the Hyperthreading issue in any way?

    Thanks PJ

  • –G4npUXN5V4uD23h2HiQt39l1eAV5UqspE
    Content-Type: text/plain; charset=windows-1252
    Content-Transfer-Encoding: quoted-printable

    I don’t think so .. at least I did not specifically add anything to do so.

    You can get it here for testing:

    https://buildlogs.CentOS.org/CentOS/7/virt/x86_64/xen/

    (or from /6/ as well for CentOS-6)

    Not sure why it did not go out on the signing run .. will check that server.

    –G4npUXN5V4uD23h2HiQt39l1eAV5UqspE

LEAVE A COMMENT