Fixing Grub/shim Issue CentOS 7

Home » CentOS » Fixing Grub/shim Issue CentOS 7
CentOS 35 Comments

After trying several paths, some suggested on this list, here’s my results.

1) Fixing a unbootable system wasn’t practical in my case. Fortunately, all my systems can be rebuilt from scratch.

2) When I was lucky enough to catch an updated system before reboot, backing out the defective updates wasn’t possible. Yum said there were no prior versions.

3) The most reliable method I found for CentOS 7 was:
– Re=install from scratch (luckily, my data files were safe and restorable)
– Before running any updates, apply the fix suggested by Redhat and exclude updates to grub2, shim and mokutil.
– Without the above ‘exclude’, the system became unbootable after a yum update even though the corrected versions of shim should have been loaded.

The system I’m dealing with is CentOS 7. I can easily rebuild it from scratch and test stuff without losing crucial data, if it would helpful.

4) I haven’t experimented yet with CentOS 8 because the hardware is remote and requires me to get a friend involved to help. My local hardware is not supported by CentOS 8, so it will remain on CentOS 7
until I replace the hardware or switch to a different Linux.

David

35 thoughts on - Fixing Grub/shim Issue CentOS 7

  • Le 03/08/2020 à 19:24, david a écrit :

    Hi,

    Just back from a hiking trip. One of my clients sent me a message that his CentOS server refuses to boot. So tomorrow I have to drive there to figure out what’s going on. I guess there’s a high probability it’s the issue discussed in this thread.

    Simple question: besides a tsunami of mailing list and forum messages, is there some to-the-point reliable information about this mess ? As well as some to-the-point reliable information about how to fix it ?

    Thanks,

    Niki


    Microlinux – Solutions informatiques durables
    7, place de l’église – 30730 Montpezat Site : https://www.microlinux.fr Blog : https://blog.microlinux.fr Mail : info@microlinux.fr Tél. : 04 66 63 10 32
    Mob. : 06 51 80 12 12

  • Hi all,

    I had the same problem with my UEFI bios machine and I fixed it so for CentOS 7:

    1) Boot from an rescue linux usb

    2) When the rescue system is running:

        2.1) #chroot /mnt/sysimage

    3) Config network:

        3.1) # ip addr add X.X.X.X/X dev X

        3.2) # ip route add default via X.X.X.X    <--- default router 4) And finally:     #yum downgrade shim\* grub2\* mokutil     #exit     #reboot I hope you can fix it with these steps. El 4/8/20 a las 0:56, Nicolas Kovacs escribió:

  • As there are updated and working packages available now, downgrading is no longer needed, another update will also work.

    # yum makecache
    # yum upgrade

    You should see a shim-x64 package with version 15.8 which is the working version (15.7 caused the problem)

  • While I appreciate the thoughts behind this step in the instructions, and I thank you for the post that will be useful to those running fairly traditional servers, there are numerous cases where this simply will not work to bring up a network while booted into the rescue mode chroot.

  • The issues should now be resolved.

    If you just mount /mnt/sysimage, set an ip address and upgrade (to get th new shim) .. then:

    yum reinstall

    Everything should just work.

  • Once upon a time, Johnny Hughes said:

    I’m curious – why does the kernel need to be reinstalled? The shim-x64
    package installs its files directly to the EFI partition where they are needed.

  • That is the easiest way for the initrd to be rebuilt .. which is what created the unbootable issue in the first place. At least is some circumstances.

    You can also regenerate your initrd manually after installing the shim.

    This is IF you are already in a failed boot condition from the bad install on Friday.

    If you are doing the upgrade/install now from a bootable system, all you need to do is a normal update.

  • Le 04/08/2020 à 08:31, lpeci a écrit :

    Thanks for the detailed suggestion.

    Unfortunately I couldn’t recover the installation, and I had to redo everything from scratch, which cost me the first two days of my holidays.

    One thought regularly kept crossing my mind:

    “How on earth could this have passed Q & A ?”

    :o)

    Cheers,

    Niki


    Microlinux – Solutions informatiques durables
    7, place de l’église – 30730 Montpezat Site : https://www.microlinux.fr Blog : https://blog.microlinux.fr Mail : info@microlinux.fr Tél. : 04 66 63 10 32
    Mob. : 06 51 80 12 12

  • Now you know how the Window$ admins suffered all the years :-)

    Quite simple I guess. It’s one of the cases where you can not test so easily like other updates. Here you have to make tests on real hardware, different hardware of all kind and can not do it in the Cloud, a VM or whatever.

    The only real solution I can think of to prevent this would be to make preview versions of updates available to the public so that a lot of people can test them on their hardware, hopefully spare hardware, and give feedback.

    I think current business models do not support such a way these days.

    However one can find strategies to survive. What I do is:

    * Never update any system directly from what is provided online. Sync to local repositories first to control what is fed to your systems.

    * Never blindly apply updates. Always do tests on not so important systems or dedicated test systems first.

    * If all goes well, update important systems. If you have multiple systems, update only one first as another test. Then update others.

    I have learned my lessons in the past decades but this was a good wake up call to follow above rules even more strictly. Better safe than sorry.

    Regards, Simon

  • A practical equivalent is simply to avoid applying updates for a week to see if someone else gets burned by them. I’m already waiting for a weekend so I don’t disrupt work in case a catastrophe happens, and I
    wait at least a week and watch this list for any reports of disaster. So I haven’t experienced this one. Let the impatient do your testing for you.

  • While this worked for me, it might not work for you…

    My “solution” was to boot the previous kernel, which came up just fine, yum remove kernel.xx.yy.zz yum install kernel.xx.yy.zz

    which rebuilds the initrd, and voila!

    Fred

  • Well, I mean that would be a valid point if it happened for every install. The issue did not happen on every install. There is no way to test every single hardware and firmware combination for every single computer ever built :)

    It would be great if things like this did not happen, but with the universe of possible combinations, i am surprised it does not happen more often.

    We do run boot tests of every single kernel for CentOS. The RHEL team runs many more tests for RHEL. But every possible combination from every vendor can’t possibly be tested. Right?

  • Il 07/08/20 08:22, Johnny Hughes ha scritto:

    Hi Johnny, Niki’s question is spread, legit, in the thoughts in many and many users so don’t see this as an attack. Many and many users,though really “if this was tested before release” and I think that many of us are incredulous at what happened on CentOS and in the upstream (specially in the upstream) but as you said CentOS inherits RHEL bugs. I’m reading about many users that lost their trust in RH with the last 2 problem
    (microcode and shim). This is bad for CentOS.

    Probably many users have not updated their machines between the bug release and the resolution (thanks to your fast apply in the weekend, thank you) and many update their CentOS machines on a 2 months base (if not worst). I think also that many users of CentOS user base have not proclamed their disappointement/the issue on this list or in other channels. For example I simply updated in the wrong time.

    you are right but is not UEFI a standard and it shouldn’t work the same on several vendors? I ask this because this patch broken all my uefi workstations.

    While CentOS team could not have so much resources to run this type of tests would be great to know what happened to RHEL QA (being RH giant)
    for this release and given the partenership between CentOS and RH if you know something more on this…..

    Thank you.

  • Le 07/08/2020 à 09:40, Alessandro Baggi a écrit :

    I’m using yum-cron to keep all my server updated on a daily basis.

    And my question “How could this have passed Q & A” was obviously directed at Red Hat… and *not* at Johnny Hughes and the CentOS team who do their best to deliver the best possible downstream system. I raise my morning coffee mug to your health, guys.

    Cheers,

    Niki


    Microlinux – Solutions informatiques durables
    7, place de l’église – 30730 Montpezat Site : https://www.microlinux.fr Blog : https://blog.microlinux.fr Mail : info@microlinux.fr Tél. : 04 66 63 10 32
    Mob. : 06 51 80 12 12

  • It would be nice if it did .. however, this worked on many UEFI/Secureboot machines. It did not work on a small subset of machines.

    I have not seen the full post event account if what actually happened. I do know that many Red Hatters worked many hours over the last weekend to fix it. I am sure a public post will be made (if not already there)
    .. if someone knows where it is, post a link.

    If I don’t see it posted soon, I’ll look for it and post here.

  • I can assure you .. a BUNCH of testing was done. Because of the scope of this udpate, the CentOS team was looped in during the embargo stage
    (we normally are not .. Red Hat Engineering got permission to make this happen for this issue). Normally we see things that are open source only
    .. not embargoed content. Once the embargo gets lifted, the items become open source. Kudos to the RH team for making this happen.

    The CentOS team worked with the RHEL team on this update for several days (more than a week, for sure, maybe 2 weeks)

    I gained MUCH respect for all those guys .. especially Peter Jones. He is Mr.Secure Boot.

    I personally tested both the c8 and c7 solutions on several machines
    (All i have access to actually, including several personal machines that have secureboot). I saw some of the testing that happened on the RHEL
    side. It was extensive.

    Microsoft, Debian, Ubuntu and others also had issues with this .. so if you are losing trust, you are losing it with all OS vendors WRT this issue.

    All I can say is .. this issue was the hardest thing I have been involved with since starting with the CentOS Project 17 years ago.

    Obviously, everyone involved in this build would have prevented this from happening if they could have. Secureboot is complicated.

  • I’ll just add to Johnny’s already comprehensive reply. As a member of the CentOS QA team, I personally tested the update on 3 physical machines and all worked fine. Moreover, the QA team was not able to replicate the issue on a single physical machine available to them – the first indication of a problem came from public reports. We give up a huge amount of our personal time and resources to ensure CentOS (and RHEL) are the very best products they can be. I’m unsure what more could have been done.

  • Thanks Phil,

    I very much appreciate all you and the rest of the QA team do.

    I know it is a knee jerk reaction to say .. how did that not get caught. I actually said it MYSELF for this very issue. But looking back, I am not sure how we could have caught it.

    “Stuff Happens” :)

    There are just a huge number of possible combinations.

  • Once upon a time, Alessandro Baggi said:

    The great thing about standards is there’s so many to choose from! Also relevant: https://xkcd.com/927/

    UEFI has gone through a number of revisions over the years, and has optional bits like Secure Boot (which itself has gone through revisions). Almost any set of standards has undefined corners where vendors interpret things differently. Vendors also have bugs in weird places sometimes.

    The firmware and boot loaders arguably are the least “exercised” parts of a system – both change rarely and there are few implementations. There’s not many combinations, and they don’t change a lot.

    I’m interested to read about the cause of this issue – something like this can be a lesson on “hmm, hadn’t thought of that before” type things to watch for in other areas.

  • I go with the lines from Pirates of the Carribean movie.. it is less of a rigid code and more a set of guidelines. Computer programmers are a surly lot, and most take any MUST/SHALL in a standard a personal challenge on how to make it pass a test but do so in an interesting way.

  • If you ask me I think the real root of the problem is that the UEFI/Secure Boot developers didn’t know KISS – or they forgot about it. Once such a beast is born you can not handle it correctly no matter how much you try.

    Regards, Simon

  • Crowd testing? Feed the green bananas to the crowd and let them ripe. It works well for some of the biggest software companies :-)

    At least it could make sense for directly hardware related stuff like kernel, boot loader, firmware/microcode and similar.

    Regards, Simon

  • Le 07/08/2020 à 11:01, Johnny Hughes a écrit :

    In my head I’ve filed this under the “sh*t happens” category. Bad luck this happened on the first day of my holiday, so I had to cancel a hiking trip. :o)

    This being said, rest assured my confidence in the CentOS project is still 100
    % intact. On a side note, I’ve just published my third book about CentOS here in France.

    Keep up the good work,

    Niki


    Microlinux – Solutions informatiques durables
    7, place de l’église – 30730 Montpezat Site : https://www.microlinux.fr Blog : https://blog.microlinux.fr Mail : info@microlinux.fr Tél. : 04 66 63 10 32
    Mob. : 06 51 80 12 12

  • Il 07/08/20 14:53, Johnny Hughes ha scritto:

    Hi Johnny,

    what is the current status of the notification tool for security updates on C8? There are possibilities to get soon announces on ML for EL8?

    Would be great have the tool working.

    Thank you.

  • Hi Alessandro,

    Compared to Microsoft , both RH and SuSE are awesome. You always need a patch management strategy with locked repos (spacewalk/pulp) which can be tested on less important systems, prior deployment on Prod. Keep in mind that Secureboot is hard to deploy in Virtual Environments and thus testing is not so easy.

    Of course, contributing to the community was always welcomed.

    Best Regards, Strahil Nikolov

    На 7 август 2020 г. 10:40:01 GMT+03:00, Alessandro Baggi написа:

  • Il 07/08/20 17:39, Leon Fauster via CentOS ha scritto:
    Hi Leon,

    so we won’t have announces soon. Until this happen why not push them on list manually?