Panic / EL6 / KVM / Kernel-2.6.32-754.2.1.el6.x86

Panic / EL6 / KVM / Kernel-2.6.32-754.2.1.el6.x86_64

Home » CentOS » Panic / EL6 / KVM / Kernel-2.6.32-754.2.1.el6.x86_64

August 29, 2018 Johnny Hughes CentOS 16 Comments

Since the update from kernel-2.6.32-754.2.1.el6.x86_64
to kernel-2.6.32-754.3.5.el6.x86_64 I can not boot my KVM guests anymore!? The workstation panics immediately!

I would not have expected this behavior now (last phase of OS). It was very robust until now (Optiplex Workstation). I see some KVM
related lines in the changelog.diff. Before swimming upstream:

Does some one have problems related to KVM with kernel-2.6.32-754.3.5.el6.x86_64 ??

16 thoughts on - Panic / EL6 / KVM / Kernel-2.6.32-754.2.1.el6.x86_64

Stephen John says:

August 29, 2018 at 4:46 pm

Not that I know of.
* Does the problem go away if you back off to 2.1 ?
* And what type of panic does it say?
* What kind of Optiplex Workstation with memory/cpu type/cores?
Johnny Hughes says:

August 29, 2018 at 5:16 pm

Am 29.08.2018 um 23:46 schrieb Stephen John Smoogen :

Yes

I will try to grep some lines at the console tomorrow.

# virsh sysinfo

Dell Inc.
A19
05/31/2011
18.0

Dell Inc.
OptiPlex 755
Not Specified
Not Specified
XXXX-3700-1058-8047-XXXX
Not Specified
Not Specified
CPU
Central Processor
Core 2 Duo
Intel
Type 0, Family 6, Model 15, Stepping 11
Not Specified
1333 MHz
5200 MHz
Populated, Enabled
Not Specified
Not Specified
2048 MB
DIMM
DIMM_1
Not Specified
DDR2
Synchronous
800 MHz
CE00000000000000
DELETED
M3 78T5663DZ3-CF7

2048 MB
DIMM
DIMM_3
Not Specified
DDR2
Synchronous
800 MHz
7F98000000000000
DELETED

2048 MB
DIMM
DIMM_2
Not Specified
DDR2
Synchronous
800 MHz
CE00000000000000
DELETED
M3 78T5663DZ3-CF7

2048 MB
DIMM
DIMM_4
Not Specified
DDR2
Synchronous
800 MHz
7F98000000000000
DELETED
Stephen John says:

August 29, 2018 at 5:35 pm

So looking at the kernel changelog, there are a lot of KVM changes which look related to the Spectre and related CVE items. All of them seem to have landed in a non-released kernel.. I am going to guess you are tickling one of them so hopefully the oops will help figure it out. The only other item I would wonder is if there is a BIOS update need again due to Spectre but that would be a last thing to try.

* Tue Jul 31 2018 Phillip Lougher [2.6.32-754.3.2.el6]
– [kvm] VMX: Fix host GDT.LIMIT corruption (CVE-2018-10301) (Paolo Bonzini)
[1601851] {CVE-2018-10901}
..
– [x86] KVM/VMX: Initialize the vmx_l1d_flush_pages’ content (Waiman Long)
[1593376] {CVE-2018-3620}
– [x86] kvm: Don’t flush L1D cache if VMENTER_L1D_FLUSH_NEVER (Waiman Long)
[1593376] {CVE-2018-3620}
– [x86] kvm: Take out the unused nosmt module parameter (Waiman Long)
[1593376] {CVE-2018-3620}
…
– [x86] bugs, kvm: Introduce boot-time control of L1TF mitigations (Waiman Long) [1593376] {CVE-2018-3620}
…
– [x86] kvm: Allow runtime control of L1D flush (Waiman Long) [1593376]
{CVE-2018-3620}
– [x86] kvm: Serialize L1D flush parameter setter (Waiman Long) [1593376]
{CVE-2018-3620}
– [x86] kvm: Move l1tf setup function (Waiman Long) [1593376]
{CVE-2018-3620}
…
– [x86] kvm: Drop L1TF MSR list approach (Waiman Long) [1593376]
{CVE-2018-3620}
…
– [x86] KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required (Waiman Long) [1593376] {CVE-2018-3620}
– [x86] KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs
(Waiman Long) [1593376] {CVE-2018-3620}
– [x86] KVM/VMX: Separate the VMX AUTOLOAD guest/host number accounting
(Waiman Long) [1593376] {CVE-2018-3620}
– [x86] KVM/VMX: Add find_msr() helper function (Waiman Long) [1593376]
{CVE-2018-3620}
– [x86] KVM/VMX: Split the VMX MSR LOAD structures to have an host/guest numbers (Waiman Long) [1593376] {CVE-2018-3620}
– [x86] KVM/VMX: Add L1D flush logic (Waiman Long) [1593376] {CVE-2018-3620}
– [kvm] VMX: Make indirect call speculation safe (Waiman Long) [1593376]
{CVE-2018-3620}
– [kvm] VMX: Enable acknowledge interupt on vmexit (Waiman Long) [1593376]
{CVE-2018-3620}
– [x86] KVM/VMX: Add L1D MSR based flush (Waiman Long) [1593376]
{CVE-2018-3620}
– [x86] KVM/VMX: Add L1D flush algorithm (Waiman Long) [1593376]
{CVE-2018-3620}
– [x86] KVM/VMX: Add module argument for L1TF mitigation (Waiman Long)
[1593376] {CVE-2018-3620}
– [x86] KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present
(Waiman Long) [1593376] {CVE-2018-3620}
– [kvm] x86: Introducing kvm_x86_ops VM init/destroy hooks (Waiman Long)
[1593376] {CVE-2018-3620}
… it keeps going and going. rpm -q kernel-2.6.32-754.3.5 –changelog will give you the gory details.
Simon Matter says:

August 29, 2018 at 11:38 pm

Is there any chance that this is related? Could you try downgrading qemu-img/qemu-kvm and see if it helps?

https://bugs.CentOS.org/view.php?id067

Regards, Simon
isdtor says:

August 30, 2018 at 3:54 am

Leon Fauster via CentOS writes:

Yes, the exact same thing happened here, and I suspect it is related to older cpus that don’t get any Spectre/Meltdown updates.

IBM x3250
Intel(R) Xeon(R) CPU E3110 @ 3.00GHz

This is a dual-core cpu of similar vintage to yours (can we have a model #?), pre-2010.

There goes a cheap and reliable VM dev machine :-/
Johnny Hughes says:

August 30, 2018 at 4:41 am

Am 30.08.2018 um 10:54 schrieb isdtor :

Thanks for the feedback. I’ was assuming that some kind of Spectre/Meltdown fixes are causing this.

processor : 1
vendor_id : GenuineIntel cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz stepping : 11
microcode : 186
cpu MHz : 2000.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes fpu_exception : yes cpuid level : 10
wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf eagerfpu pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm pti retpoline tpr_shadow vnmi flexpriority bogomips : 5984.84
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual power management:

No way. Should all IT departments trash a big percentage of there hardware now?
Simon Matter says:

August 30, 2018 at 5:06 am

Doesn downgrading qemu as I proposed in the other mail fix it in your case?

I’m interested because in my case I’m having the issue on two older AMD
CPUs, not Intel.

I second that, I really hope this will be fixed.

Regards, Simon
Johnny Hughes says:

August 30, 2018 at 5:17 am

Am 30.08.2018 um 06:37 schrieb Simon Matter :

Thanks to pointing to that bug report.

I downgraded from

qemu-kvm-0.12.1.2-2.506.el6_10.1.x86_64
qemu-img-0.12.1.2-2.506.el6_10.1.x86_64

to

qemu-img-0.12.1.2-2.503.el6_9.6.x86_64
qemu-kvm-0.12.1.2-2.503.el6_9.6.x86_64

and booted into

kernel-2.6.32-754.3.5.el6.x86_64

unfortunately no positiv effect.

The system/host/workstation crashes/panics immediately after starting a guest system (EL6) with virsh start guest-name.

BTW upstream bug report:

https://bugzilla.redhat.com/show_bug.cgi?id23692
isdtor says:

August 30, 2018 at 5:23 am

Simon, downgrading does not fix the problem in my case. I upgraded to the 3.5 kernel, disabled libvirtd and libvirtd-guests, and rebooted. Once the machine was up, I started the libvirtd service and the machine crashed immediately.

Old: qemu-img-0.12.1.2-2.506.el6_10.1, qemu-kvm-0.12.1.2-2.506.el6_10.1 (crash on 2.6.32-754.3.5)
New: qemu-img-0.12.1.2-2.503.el6_9.6, qemu-kvm-0.12.1.2-2.503.el6_9.6 (crash on 2.6.32-754.3.5)

In Leon’s and my case, the culprit is kernel 2.6.32-754.3.5. In CentOS bug 0015067, the bad kernel is 2.6.32-754.2.1, wich works fine here. The difference might be how different cpus are handled.
Stephen John says:

August 30, 2018 at 7:14 am

I am going to say from chip and OEM manufacturers view points: yes. For at least the last 15 years, they have priced out their hardware to have a 4-6
year lifetime. Consumers of said hardware are supposed to plan around that with magical budget money and replace their hardware regularly. Since rarely do people have said money, we have mostly gotten away with having not having to do so because big things like this don’t happen very often.
[The last big one was all the working hardware people had to get rid of for Y2K.]

The fixes to the old hardware are going to be problematic for a lot of different reasons (Intel isn’t fixing its microcode, backporting deep kernel rewrites to very old kernels tends to crash a lot, etc.) I would recommend one of the following strategies:

1. Let your budget know that there will be a lot of replacements coming up. Replace hardware as you can.
2. Make a decision about what your security risk is for this problem, stick to an old kernel and put virtual systems which match your security risk on the old hardware.
3. Test a newer kernel/release on the hardware and see if the problem does not occur on it. If it does, then it is doubtful that the fix can be backported until it is fixed in the newer version. If it doesn’t then it might help figure out where the breakage is.
says:

August 30, 2018 at 9:31 am

Stephen John Smoogen wrote:

Heh, heh. We’re starting to replace the servers that we got when I was first here… in ’09 and ’10 and ’11. But then, I’m a contractor at a US
federal gov’t agency in the civilian sector, and budgets, um, right, LOL. Next time someone complains about “waste of tax dollars”, why, just last year, or was it earlier this year, we finally retired a few servers that had actual SCSI drives….

The only problems we’ve had on the latest C 7 kernels *seem* to be related to a specific Intel chip or two. Otherwise, the older servers work just fine.

mark
Simon Matter says:

August 30, 2018 at 10:24 am

Unfortunately I only get “You are not authorized to access bug #1623692”.

Simon
Johnny Hughes says:

August 30, 2018 at 1:00 pm

Kernel related bugs seems to get a nonpublic-flag. Free account creation …
Simon Matter says:

August 30, 2018 at 1:28 pm

Well, I have an account and am logged in, still can not see the bug.

Simon
Johnny Hughes says:

September 12, 2018 at 3:10 pm

Am 30.08.2018 um 20:28 schrieb Simon Matter :

It seems that the default bugzilla classification doesn’t allow bug reports associated with the kernel to get a read status or so … just to summarize the status briefly:
They fortunately can reproduce the problem and are trying to find the cause now. So, +1!
Johnny Hughes says:

November 23, 2018 at 5:36 am

Follow up: The current kernel (kernel-2.6.32-754.6.3.el6.x86_64) seems to be usable again. No issues so far …

Panic / EL6 / KVM / Kernel-2.6.32-754.2.1.el6.x86_64

16 thoughts on - Panic / EL6 / KVM / Kernel-2.6.32-754.2.1.el6.x86_64

Recommended

Recent Posts

Recent Comments

Archives

Categories

Meta