Panic / EL6 / KVM / Kernel-2.6.32-754.2.1.el6.x86_64

Home » CentOS » Panic / EL6 / KVM / Kernel-2.6.32-754.2.1.el6.x86_64
CentOS 16 Comments

Since the update from kernel-2.6.32-754.2.1.el6.x86_64
to kernel-2.6.32-754.3.5.el6.x86_64 I can not boot my KVM guests anymore!? The workstation panics immediately!

I would not have expected this behavior now (last phase of OS). It was very robust until now (Optiplex Workstation). I see some KVM
related lines in the changelog.diff. Before swimming upstream:

Does some one have problems related to KVM with kernel-2.6.32-754.3.5.el6.x86_64 ??

16 thoughts on - Panic / EL6 / KVM / Kernel-2.6.32-754.2.1.el6.x86_64

  • Not that I know of.
    * Does the problem go away if you back off to 2.1 ?
    * And what type of panic does it say?
    * What kind of Optiplex Workstation with memory/cpu type/cores?

  • Am 29.08.2018 um 23:46 schrieb Stephen John Smoogen :

    Yes

    I will try to grep some lines at the console tomorrow.

    # virsh sysinfo


    Dell Inc.
    A19
    05/31/2011
    18.0


    Dell Inc.
    OptiPlex 755
    Not Specified
    Not Specified
    XXXX-3700-1058-8047-XXXX
    Not Specified
    Not Specified
    CPU
    Central Processor
    Core 2 Duo
    Intel
    Type 0, Family 6, Model 15, Stepping 11
    Not Specified
    1333 MHz
    5200 MHz
    Populated, Enabled
    Not Specified
    Not Specified

    2048 MB
    DIMM
    DIMM_1
    Not Specified
    DDR2
    Synchronous
    800 MHz
    CE00000000000000
    DELETED
    M3 78T5663DZ3-CF7


    2048 MB
    DIMM
    DIMM_3
    Not Specified
    DDR2
    Synchronous
    800 MHz
    7F98000000000000
    DELETED


    2048 MB
    DIMM
    DIMM_2
    Not Specified
    DDR2
    Synchronous
    800 MHz
    CE00000000000000
    DELETED
    M3 78T5663DZ3-CF7


    2048 MB
    DIMM
    DIMM_4
    Not Specified
    DDR2
    Synchronous
    800 MHz
    7F98000000000000
    DELETED

  • So looking at the kernel changelog, there are a lot of KVM changes which look related to the Spectre and related CVE items. All of them seem to have landed in a non-released kernel.. I am going to guess you are tickling one of them so hopefully the oops will help figure it out. The only other item I would wonder is if there is a BIOS update need again due to Spectre but that would be a last thing to try.

    * Tue Jul 31 2018 Phillip Lougher [2.6.32-754.3.2.el6]
    – [kvm] VMX: Fix host GDT.LIMIT corruption (CVE-2018-10301) (Paolo Bonzini)
    [1601851] {CVE-2018-10901}
    ..
    – [x86] KVM/VMX: Initialize the vmx_l1d_flush_pages’ content (Waiman Long)
    [1593376] {CVE-2018-3620}
    – [x86] kvm: Don’t flush L1D cache if VMENTER_L1D_FLUSH_NEVER (Waiman Long)
    [1593376] {CVE-2018-3620}
    – [x86] kvm: Take out the unused nosmt module parameter (Waiman Long)
    [1593376] {CVE-2018-3620}

    – [x86] bugs, kvm: Introduce boot-time control of L1TF mitigations (Waiman Long) [1593376] {CVE-2018-3620}

    – [x86] kvm: Allow runtime control of L1D flush (Waiman Long) [1593376]
    {CVE-2018-3620}
    – [x86] kvm: Serialize L1D flush parameter setter (Waiman Long) [1593376]
    {CVE-2018-3620}
    – [x86] kvm: Move l1tf setup function (Waiman Long) [1593376]
    {CVE-2018-3620}

    – [x86] kvm: Drop L1TF MSR list approach (Waiman Long) [1593376]
    {CVE-2018-3620}

    – [x86] KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required (Waiman Long) [1593376] {CVE-2018-3620}
    – [x86] KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs
    (Waiman Long) [1593376] {CVE-2018-3620}
    – [x86] KVM/VMX: Separate the VMX AUTOLOAD guest/host number accounting
    (Waiman Long) [1593376] {CVE-2018-3620}
    – [x86] KVM/VMX: Add find_msr() helper function (Waiman Long) [1593376]
    {CVE-2018-3620}
    – [x86] KVM/VMX: Split the VMX MSR LOAD structures to have an host/guest numbers (Waiman Long) [1593376] {CVE-2018-3620}
    – [x86] KVM/VMX: Add L1D flush logic (Waiman Long) [1593376] {CVE-2018-3620}
    – [kvm] VMX: Make indirect call speculation safe (Waiman Long) [1593376]
    {CVE-2018-3620}
    – [kvm] VMX: Enable acknowledge interupt on vmexit (Waiman Long) [1593376]
    {CVE-2018-3620}
    – [x86] KVM/VMX: Add L1D MSR based flush (Waiman Long) [1593376]
    {CVE-2018-3620}
    – [x86] KVM/VMX: Add L1D flush algorithm (Waiman Long) [1593376]
    {CVE-2018-3620}
    – [x86] KVM/VMX: Add module argument for L1TF mitigation (Waiman Long)
    [1593376] {CVE-2018-3620}
    – [x86] KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present
    (Waiman Long) [1593376] {CVE-2018-3620}
    – [kvm] x86: Introducing kvm_x86_ops VM init/destroy hooks (Waiman Long)
    [1593376] {CVE-2018-3620}
    … it keeps going and going. rpm -q kernel-2.6.32-754.3.5 –changelog will give you the gory details.

  • Leon Fauster via CentOS writes:

    Yes, the exact same thing happened here, and I suspect it is related to older cpus that don’t get any Spectre/Meltdown updates.

    IBM x3250
    Intel(R) Xeon(R) CPU E3110 @ 3.00GHz

    This is a dual-core cpu of similar vintage to yours (can we have a model #?), pre-2010.

    There goes a cheap and reliable VM dev machine :-/

  • Am 30.08.2018 um 10:54 schrieb isdtor :

    Thanks for the feedback. I’ was assuming that some kind of Spectre/Meltdown fixes are causing this.

    processor : 1
    vendor_id : GenuineIntel cpu family : 6
    model : 15
    model name : Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz stepping : 11
    microcode : 186
    cpu MHz : 2000.000
    cache size : 4096 KB
    physical id : 0
    siblings : 2
    core id : 1
    cpu cores : 2
    apicid : 1
    initial apicid : 1
    fpu : yes fpu_exception : yes cpuid level : 10
    wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf eagerfpu pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm pti retpoline tpr_shadow vnmi flexpriority bogomips : 5984.84
    clflush size : 64
    cache_alignment : 64
    address sizes : 36 bits physical, 48 bits virtual power management:

    No way. Should all IT departments trash a big percentage of there hardware now?

  • Doesn downgrading qemu as I proposed in the other mail fix it in your case?

    I’m interested because in my case I’m having the issue on two older AMD
    CPUs, not Intel.

    I second that, I really hope this will be fixed.

    Regards, Simon

  • Am 30.08.2018 um 06:37 schrieb Simon Matter :

    Thanks to pointing to that bug report.

    I downgraded from

    qemu-kvm-0.12.1.2-2.506.el6_10.1.x86_64
    qemu-img-0.12.1.2-2.506.el6_10.1.x86_64

    to

    qemu-img-0.12.1.2-2.503.el6_9.6.x86_64
    qemu-kvm-0.12.1.2-2.503.el6_9.6.x86_64

    and booted into

    kernel-2.6.32-754.3.5.el6.x86_64

    unfortunately no positiv effect.

    The system/host/workstation crashes/panics immediately after starting a guest system (EL6) with virsh start guest-name.

    BTW upstream bug report:

    https://bugzilla.redhat.com/show_bug.cgi?id23692

  • Simon, downgrading does not fix the problem in my case. I upgraded to the 3.5 kernel, disabled libvirtd and libvirtd-guests, and rebooted. Once the machine was up, I started the libvirtd service and the machine crashed immediately.

    Old: qemu-img-0.12.1.2-2.506.el6_10.1, qemu-kvm-0.12.1.2-2.506.el6_10.1 (crash on 2.6.32-754.3.5)
    New: qemu-img-0.12.1.2-2.503.el6_9.6, qemu-kvm-0.12.1.2-2.503.el6_9.6 (crash on 2.6.32-754.3.5)

    In Leon’s and my case, the culprit is kernel 2.6.32-754.3.5. In CentOS bug 0015067, the bad kernel is 2.6.32-754.2.1, wich works fine here. The difference might be how different cpus are handled.

  • I am going to say from chip and OEM manufacturers view points: yes. For at least the last 15 years, they have priced out their hardware to have a 4-6
    year lifetime. Consumers of said hardware are supposed to plan around that with magical budget money and replace their hardware regularly. Since rarely do people have said money, we have mostly gotten away with having not having to do so because big things like this don’t happen very often.
    [The last big one was all the working hardware people had to get rid of for Y2K.]

    The fixes to the old hardware are going to be problematic for a lot of different reasons (Intel isn’t fixing its microcode, backporting deep kernel rewrites to very old kernels tends to crash a lot, etc.) I would recommend one of the following strategies:

    1. Let your budget know that there will be a lot of replacements coming up. Replace hardware as you can.
    2. Make a decision about what your security risk is for this problem, stick to an old kernel and put virtual systems which match your security risk on the old hardware.
    3. Test a newer kernel/release on the hardware and see if the problem does not occur on it. If it does, then it is doubtful that the fix can be backported until it is fixed in the newer version. If it doesn’t then it might help figure out where the breakage is.

  • Stephen John Smoogen wrote:

    Heh, heh. We’re starting to replace the servers that we got when I was first here… in ’09 and ’10 and ’11. But then, I’m a contractor at a US
    federal gov’t agency in the civilian sector, and budgets, um, right, LOL. Next time someone complains about “waste of tax dollars”, why, just last year, or was it earlier this year, we finally retired a few servers that had actual SCSI drives….

    The only problems we’ve had on the latest C 7 kernels *seem* to be related to a specific Intel chip or two. Otherwise, the older servers work just fine.

    mark

  • Am 30.08.2018 um 20:28 schrieb Simon Matter :

    It seems that the default bugzilla classification doesn’t allow bug reports associated with the kernel to get a read status or so … just to summarize the status briefly:
    They fortunately can reproduce the problem and are trying to find the cause now. So, +1!

  • Follow up: The current kernel (kernel-2.6.32-754.6.3.el6.x86_64) seems to be usable again. No issues so far …