Server Fails To Boot

Home » CentOS » Server Fails To Boot

July 8, 2019 Rob Kampen CentOS 2 Comments

First some history. This is an Intel MB and processor some 6 years old, initially running CentOS 6. It has 4 x 1TB sata drives set up in two mdraid 1 mirrors. It has performed really well in a rural setting with frequent power cuts which the UPS has dealt with and auto shuts down the server after a few minutes and then auto restarts when power is restored.

The clients needed a Windoze server for a proprietary accounting package they use, thus I have recently installed two SSD drives (500GB each)
also in a raid 1 mirror and installed CentOS 7 as the host and also VirtualBox running Windoze 10. The hard drives continue to hold their data files.

This appeared to work just fine until a few days ago. After a power cut the server would not reboot.

It takes a while to get in front of the machine, add a monitor, keyboard and mouse only to find:

Warning: /dev/disk/by-id/md-uuid-xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx does not exist

repeated three times – one for each of the /, /boot, and swap raid member sets along with a

Warning: /dev/disk/by-uuid/xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx does not exist

for the /dev/md125 which is the actual raid 1 / device.

The system is in a root shell of some sort as it has not made the transition from initramfs to the mdraid root drive.

there are some other lines of info and a txt file with hundreds of lines of boot info, ending with the above info (as I recall).

I tried a reboot – same result, rebooted and tried an earlier kernel –
same result, tried a reboot to the recovery kernel and all went well. System comes up, all raids sets are up and in sync – no errors.

So, no apparent H/W issues, no mdraid issues apparently, but none of the regular kernels will now boot.

a blkid shows all the expected mdraid devices with the uuids from the error message all in place as expected.

I did a yum reinstall of the most recent kernel as I thought that may repair any /boot file system problems – particularly initramfs, but no difference, will not boot, same exact error messages.

Thus I now have it running on the recovery kernel, with all the required server functions being performed, albeit on an out of date kernel.

Google has one solved problem similar to mine but the solution was change the BIOS from AHCI to IDE – that does not seem correct as I have not changed BIOS, although I have not checked it at this time.

Another solution talks about a race condition and the md raid not being ready when required during the boot process and thus to add delay in the kernel boot line in grub2. Although no one indicated this actually worked.

Another proposed solution is to mount the failed devices from a recovery boot and rebuild initramfs. Before I do this I would like to ask those that know a little more about the boot process, what is going wrong? I
can believe the most recent initramfs being a problem, but all three other kernels too?? Yet the recovery kernel works just fine.

As the system is remote, I would like some understanding of what’s up before I do any changes – if a reboot occurs and fails, it will mean another trip.

Oh, one other thing, it seems the UPS is not working correctly, thus it may not have shut down cleanly. Working to replace batteries in the UPS.

TIA for your insight.

2 thoughts on - Server Fails To Boot

Gordon Messmer says:

July 13, 2019 at 5:16 pm

https://bugzilla.redhat.com/show_bug.cgi?id=1451660

It sounds like your kernels aren’t assembling the RAID device on boot, which *might* be related to the above bug if one of the devices is broken. It’s hard to tell from your description. You mentioned that the rescue kernel boots, but I wonder if the array is degraded at that point.

Otherwise, you might remove “rhgb” and “quiet” from the kernel boot parameters and see if there’s any useful information printed to the console while booting a recent kernel.
Rob Kampen says:

July 14, 2019 at 3:53 am

devices (/,/boot,swap) or certainly not setting the links /dev/md/root
/dev/md/boot and /dev/md/swap which then cause dracut to fail.

I have no idea why the rescue kernel boots just fine, although it does not establish the above links either, rather it sets up the links
/dev/md/:{boot,root,swap} pointing to the assembled /dev/md125
etc.

My particular problem is: how do I get it to boot the later kernels?
What should be my repair process?

I have tried a boot with the rhgb and quiet removed and got no additional information.

BTW once booted cat /proc/mdstat gives:

Personalities : [raid1]
md57 : active raid1 sdb7[1] sda7[0]
      554533696 blocks super 1.2 [2/2] [UU]

md99 : active raid1 sdd[1] sdc[0]
      976631360 blocks super 1.2 [2/2] [UU]

md121 : active raid1 sdb2[1] sda2[0]
      153500992 blocks [2/2] [UU]

md120 : active raid1 sda3[0] sdb3[1]
      263907712 blocks [2/2] [UU]

md125 : active raid1 sde1[0] sdf1[1]
      478813184 blocks super 1.2 [2/2] [UU]
      bitmap: 3/4 pages [12KB], 65536KB chunk

md126 : active raid1 sde2[0] sdf2[1]
      1046528 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md127 : active (auto-read-only) raid1 sde3[0] sdf3[1]
      8382464 blocks super 1.2 [2/2] [UU]

unused devices:

no degraded raid devices …..

Server Fails To Boot

2 thoughts on - Server Fails To Boot

Recommended

Recent Posts

Recent Comments

Archives

Categories

Meta