CentOS 5 Grub Boot Problem

Home » CentOS » CentOS 5 Grub Boot Problem
CentOS 49 Comments

I am trying to upgrade my system from 500GB drives to 1TB. I was able to partition and sync the raid devices, but I cannot get the new drive to boot.

This is an old system with only IDE ports. There is an added Highpoint raid card which is used only for the two extra IDE ports. I have upgraded it with a 1TB SATA drive and an IDE-SATA adapter. I did not have any problems with the system recognizing the drive or adding it to the mdraid. A short SMART test shows no errors.

Partitions:
Disk /dev/hdg: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hdg1 1 25 200781 fd Linux raid autodetect
/dev/hdg2 26 121537 976045140 fd Linux raid autodetect
/dev/hdg3 121538 121601 514080 fd Linux raid autodetect

Raid:
Personalities : [raid1]
md0 : active raid1 hdg1[1] hde1[0]
200704 blocks [2/2] [UU]

md1 : active raid1 hdg3[1] hde3[0]
513984 blocks [2/2] [UU]

md2 : active raid1 hdg2[1] hde2[0]
487644928 blocks [2/2] [UU]

fstab (unrelated lines removed):
/dev/md2 / ext3 defaults 1 1
/dev/md0 /boot ext3 defaults 1 2
/dev/md1 swap swap defaults 0 0

I installed grub on the new drive:
grub> device (hd0) /dev/hdg

grub> root (hd0,0)
Filesystem type is ext2fs, partition type 0xfd

grub> setup (hd0)
Checking if “/boot/grub/stage1” exists… no
Checking if “/grub/stage1” exists… yes
Checking if “/grub/stage2” exists… yes
Checking if “/grub/e2fs_stage1_5” exists… yes
Running “embed /grub/e2fs_stage1_5 (hd0)”… 15 sectors are embedded. succeeded
Running “install /grub/stage1 (hd0) (hd0)1+15 p (hd0,0)/grub/stage2
/grub/grub.conf”… succeeded Done.

But when I attempt to boot from the drive (with or without the other drive connected and in either IDE connector on the Highpoint card), it fails. Grub attempts to boot, but the last thing I see after the bios is the line “GRUB Loading stage 1.5”, then the screen goes black, the system speaker beeps, and the machine reboots. This will continue as long as I let it. As soon as I switch the boot drive back to the original hard drive, It boots up normally.

I also tried installing grub as (hd1) with the same results.

A few Google searches haven’t turned up any hits with this particular problem and all of the similar problems have been with Ubuntu and grub2.

Any suggestions?

Thanks,

49 thoughts on - CentOS 5 Grub Boot Problem

  • Bowie Bailey wrote:

    Trying to get your configuration clear in my mind – the drives are 1TB
    IDE, and they’re attached to the m/b, or to the Hpt RAID card?

    Also, did you update the system? New kernel? If so, is the RAID card recognized (we’ve got a Hpt RocketRaid card in a CentOS 6 system, and we’re *finally* replacing it with an LSI (once it comes in), because Hpt does not care about old cards, and I had to find the source code, and then hack it to compile it for the new kernel, and have had to recompile for the new kernels we’ve installed….

    mark

  • It was originally a pair of 500GB IDE drives in an mdraid mirror configuration. Right now, I have removed one 500GB drive and replaced it with a 1TB SATA drive with an IDE-SATA adapter. Both drives are connected to the Highpoint card and apparently working fine other than the boot-up problem.

    I was considering adding an SATA card to the system, but I didn’t want to deal with finding drivers for a card old enough to work with this system (32-bit PCI).

    I have not done any updates to the system in quite some time.

  • Bowie Bailey wrote:

    1. Have you, during POST, gone into the Hpt controller firmware and made sure that it sees and presents the new disk properly?
    2. If that’s good, then I’m wondering if the initrd needs a SATA driver, which it may not have, since the old version of your system was all IDE.

    mark

  • It’s possible, but why would that be the case? The only thing that has changed from the OS point of view is the partition size on one of the drives. The filesystems are still the same.

    Also, as I said, it doesn’t even get as far as attempting to boot Linux. It fails immediately after the “GRUB Loading stage 1.5” line, so it seems like a grub issue of some sort.

  • I’m going to guess that there are no IDE drives that have 4096 byte physical sectors, but it’s worth confirming you don’t have such a drive because the current partition scheme you’ve posted would be sub-optimal if it does have 4096 byte sectors.

    I was able to

    In the realm of totally esoteric and not likely the problem, 0xfd is for mdadm metadata v0.9 which uses kernel autodetect. If the mdadm metadata is 1.x then the type code ought to be 0xda but this is so obscure that parted doesn’t even support it. fdisk does but I don’t know when support was added. This uses initrd autodetect rather than the deprecated kernel autodetect. It’s fine to use 0.9 even though it’s deprecated.

    You can use mdadm -E on each member device (each partition) to find out what metadata version is being used.

    Normally GRUB stage 1.5 is not needed, stage 1 can jump directly to stage 2 if it’s in the MBR gap. But your partition scheme doesn’t have an MBR gap, you’ve started the first partition at LBA 1. So that means it’ll have to use block lists…

    I’m confused. I don’t know why this succeeds because the setup was pointed to hd0, which means the entire disk, not a partition, and yet the disk doesn’t have an MBR gap. So there’s no room for GRUB stage 2.

    Yeah it says it’s succeeding but it really isn’t, I think. The problem is not the initrd yet, because that could be totally busted or missing, and you should still get a GRUB menu. This is all a failure of getting to stage 2, which then can read the file system and load the rest of its modules.

    I’m disinclined to believe that hd0 or hd1 translate into hdg, but I
    forget how to list devices in GRUB legacy. I’m going to bet though that device.map is stale and it probably needs to be recreated, and then find out what the proper hdX is for hdg. And then I think you’re going to need to point it at a partition using hdX,Y.

  • Oops. I just reread that this is now SATA. New versions of hdparm and smartctl can tell you if the drive is Advanced Format, and if it is, then I recommend redoing the partition scheme so it’s 4K aligned. And so that it has an MBR gap. The current way to do this is have the 1st partition start at LBA 2048.

  • The partition table was originally created by the installer.

    Version : 0.90.00

    I’m not sure. It’s been so long that I don’t remember what I did (if anything) to get grub working on the second drive of the set. The first drive was configured by the installer.

    What I’m doing now is what I found to work for my backup system which gets a new drive in the raid set every month.

    I’m willing to give that a try. The device.map looks good to me:
    (hd0) /dev/hde
    (hd1) /dev/hdg

    It is old, but the drives are still connected to the same connectors, so it should still be valid.

    How would I go about pointing it at the partition?

    What I am currently doing is this:
    device (hd0) /dev/hdg root (hd0,0)
    setup (hd0)

    Would I just need to change the setup line to “setup (hd0,0)”, or is there more to it than that?

    Also, the partitions are mirrored, so if I install to a partition, I
    will affect the working drive as well. I’m not sure I want to risk breaking the setup that still works. I can take this machine down for testing pretty much whenever I need to, but I can’t leave it down for an extended period of time.

  • I tried ‘smartctl -a’ and ‘hdparm -I’, but I don’t see anything about Advanced Format. What am I looking for?

    I can redo the partitions, but I’m not sure how to tell fdisk to start a partition at LBA 2048.

  • Well, the CentOS5 installer and partitioning utility (parted) predate Advanced Format drives. I’d check to see whether you have such a drive.

    Therefore 0xfd is correct.

    I think you need to confirm that the device.map is correct. I just don’t remember the command to figure out the mapping.

    setup (hd1,0)

    It’s hd1 if your device map is correct and hdg is hd1. And then ,0 is for the first partition assuming that’s an ext3 boot partition.

  • # smartctl -i /dev/hdg | grep -i sector Sector Size: 512 bytes logical/physical

    That’s what I get, but it’s an SSD so it’s a lie.

    Let’s figure that out only if it’s an AF disk…

  • I don’t get a “Sector Size” line.

    smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/

    === START OF INFORMATION SECTION ==Device Model: WDC WD10EZEX-60M2NA0
    Serial Number: WD-WCC3F6AX0119
    Firmware Version: 03.01A03
    User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall]
    ATA Version is: 9
    ATA Standard is: Not recognized. Minor revision code: 0x1f Local Time is: Wed Aug 5 13:09:16 2015 EDT
    SMART support is: Available – device has SMART capability. SMART support is: Enabled

  • Nothing about hd0 or hd1 gets baked into the bootloader code. It’s an absolute reference to a physical drive at the moment in time the command is made. If there is only one drive connected when you initiate this command, then it’s hd0. Almost invariably hd0 is the current boot drive, or at least it’s the first drive as enumerated by the BIOS.

    So long as the drive in question gets a bootloader, it’ll boot regardless of what hdX designation it takes. I’m just not totally convinced the designation is correct here because I really don’t see how ‘setup hd0’ works on a drive that has no MBR gap.

    OK that version predates sector size info. See if your version of hdparm will do it:

    # hdparm -I /dev/hdg | grep -i sector

    That spits out several lines for me, including
    Physical Sector size: 512 bytes

    Another one is:
    # parted -l /dev/hdg | grep -i sector

    I’m willing to bet that physical sector size is 4096 bytes

    I looked this up and found this:
    http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771436.pdf

    That lists the 1TB as being advanced format. If that’s the correct spec sheet then the next question is what is the workload for this drive? If it’s just a boot drive and performance is not a consideration then you can leave it alone, the drive firmware will do RWM internally for the wrong alignment. But if performance is important (file sharing, database stuff, small file writes including web server), then this needs to get fixed…

  • – Ahh OK now I see why I was confused. The originally posted partition map uses cylinders as units, not LBA. I missed that. Cylinder 1 is the same as LBA 63. And that is sufficiently large for a GRUB legacy stage
    2.

    – OK this is screwy. Partitions 1 and 3 on both drives have the same number of sectors, but partitions 2 differ:

    /dev/hde2 401,625 975,691,709 975,290,085 fd Linux raid autodetect
    /dev/hdg2 401,625 1,952,491,904 1,952,090,280 fd Linux raid autodetect

    That can’t work as these are two partitions meant to form /dev/md2 and need to be the same size.

    – Also, 401625 is not 8 sector aligned. So it’s a double whammy and since it has to be repartitioned anyway you might as well fix the alignment also.

    First off fail+remove hdg2 (you need to confirm I’ve got the devices and commands right here):
    mdadm –manage /dev/md2 -f /dev/hdg2 -r /dev/hdg2
    mdadm –zero-superblock /dev/hdg2

    Using fdisk delete hdg2, then make a new primary partition (partition
    2) and hopefully figure out how to get it to do LBA rather than CHS
    entry; or use parted which can but it’s UI is totally unlike fdisk. The start sector for hdg2 should be 401623 which is 8 sector aligned, and the end value should be 975691717 in order to make it the same as hde2. And change the type to 0xfd.

    Now you probably have to reboot because the partition map has changed, I’m not sure if partprobe exists on CentOS5, could be worth a shot though and see if the kernel gets the new partition map. Check with blkid.

    And then finally add the “new” device. mdadm –managed /dev/md2 -a /dev/hdg2

    And now it should be resyncing… cat /proc/mdstat

    Something like that. Proof read it!

  • Dumb thought: I don’t remember how, other than from a grub menu, but I’m pretty sure there’s a way to default boot into a grub shell. Once there, you can see, using file completion, the drives, and where your initrd is.

    mark

  • It’s definitely not an initrd problem. a.) the failure happens before the GRUB menu appears so it hasn’t even gone looking for an initrd, b.) the initrd is technically on an array not a device, and as long as the array is sync’d on both devices, it’s the same, and since it works on one device, it should work on the other and c.) it’s v0.9 mdadm metadata which is kernel autodetect so the initrd doesn’t do the assembly.

    I think once the partition stuff is fixed, and synced, then it will be more reliable to do this because GRUB is after all being pointed to member devices, not the array.

    There might be more luck using this command at command prompt:

    grub-install –recheck /dev/hdg

    See if that repopulates the device.map correctly. It should use /boot
    (/dev/md0) automatically for stage2.

  • That’s because I’m intending to increase the size of that filesystem.
    The raid should work as long as the new partition is at least as big as the old one. Once I get this working, I will remove the original drive and add another 1TB drive so both partitions are the same (larger size)
    and extend the filesystem into the new space.

    But if both hde and hdg are using 401625, then wouldn’t I have to repartition both drives so the sizes match?

    I’m still not sure that this is a partitioning problem. I did not have any problems create the partitions or syncing the three raid devices.

  • Good thought. I went into the grub.conf, commented out the “hiddenmenu”
    option and increased the timeout to 10 seconds. This works if I boot from the original drive, but it doesn’t help with the new drive. It’s not getting that far.

  • Can’t risk killing the system at the moment. I’ll give it a try tomorrow.

    However, I do note that the man page for grub-install has a comment about –recheck stating “This option is unreliable and its use is strongly discouraged.”

  • Bowie Bailey wrote:
    I *think* what you may have to do is:
    1. use mdadm to remove the new drive from the RAID.
    2. use it to create a new md drive with *just* the new drive.
    3. copy from the remaining old RAID drive to the new.
    4. remove the old RAID drive, then put in a new large drive.
    5. add the new drive to the new array.

    mark

  • Got it. OK then it all comes down to the workload and whether 4KiB
    alignment is worth changing the partitioning.

    If it is, quite honestly I’d just start over with this 1TiB drive:
    i.e. fail and remove all three partitions, and then wipe the superblock off each partition too (I don’t know if CentOS 5 has wipefs but if it does use that with -a switch, e.g. ‘wipefs -a /dev/hdg[123]’
    and then ;wipefs -a /dev/hdg’ which will remove the ext3, swap and mdadm signatures and avoids problems down the road; then repartition doing two things: start the first partition at sector 2048, and only specify the size in whole megabytes. 2048 is aligned, and by making each partition increment 1MB each partition is also aligned.

    No because as you said, the replacement just needs to be same or larger sized. mdadm does not care if member device partitions have different start sectors.

    It’s not a partitioning problem. But there’s no point in proceeding with the bootloader stuff until you’ve settled on the partition layout. If you want, in the meantime, you could test if ‘grub-install
    –recheck /dev/hdg’ is at least accepted, and if that changes the outcome of either the bootinfoscript’s bootloader section or actually test if it boots. The misalignment is a performance penalty, it’s not a whether it works or not penalty.

  • I never thought I’d say this, but I think it’s easier to do this with GRUB 2. Anyway I did an installation to raid1’s in CentOS 6’s installer, which still uses GRUB legacy. I tested removing each of the two devices and it still boots. These are the commands in its log:

    : Running… [‘/sbin/grub-install’, ‘–just-copy’]
    : Running… [‘/sbin/grub’, ‘–batch’, ‘–no-floppy’,
    ‘–device-map=/boot/grub/device.map’]
    : grub> device (hd0) /dev/vdb
    : grub> root (hd0,1)
    : grub> install –stage2=/boot/grub/stage2 /grub/stage1 d (hd0,1)
    /grub/stage2 p (hd0,1)/grub/grub.conf
    : Running… [‘/sbin/grub’, ‘–batch’, ‘–no-floppy’,
    ‘–device-map=/boot/grub/device.map’]
    : grub> root (hd0,1)
    : grub> install –stage2=/boot/grub/stage2 /grub/stage1 d (hd0,1)
    /grub/stage2 p (hd0,1)/grub/grub.conf

    I do not know why there’s a duplication of the install command. It also looks like the way it knows it’s supposed to install two bootloader stage1 and stage2 to two different devices is with

    devices (hd0) /dev/vdb

    I don’t know why the split referencing. (hd0) is /dev/vda and (hd1) is
    /dev/vdb. Weird. But it does work.

    hd0,1 in my case is /boot, but yours is hd0,0 since it’s the first partition. So anywhere the steps above say hd0,1 you probably need hd0,0.

    Chris Murphy

  • Why “extra”? Are there drives connected to this system other than the two you’re discussing for the software RAID sets?

    I know you said that you can’t take the system down for an extended period of time. Do you have enough time to connect the two 1TB drives and nothing else, and do a new install? It would be useful to know if such an install booted, to exclude the possibility that there’s some fundamental incompatibility between some combination of the BIOS, the Highpoint boot ROM, and the 1TB drives.

    If it doesn’t boot, you have the option of putting the bootloader, kernel, and initrd on some other media. You could boot from an optical disc, or a USB drive, or CF.

  • To be honest, I don’t remember why the Highpoint card was used. It could be that I had originally intended to use the raid capabilities of the card, or maybe I just didn’t want the two members of the mirror to be master/slave on the same IDE channel.

    Doing a new install on the two 1TB drives is my current plan. If that works, I can connect the old drive, copy over all the data, and then try to figure out what I need to do to get all the programs running again.

  • Sounds like a pain. I would just adapt the CentOS 6 program.log commands for your case. That’s a 2 minute test. And it ought to work.

    Clearly the computer finds the drive, reads the MBR and executes stage
    1. The missing part is it’s not loading or not executing stage 2 for some reason. I’m just not convinced the bootloader is installed correctly is the source of the problem with the 2nd drive. It’s not like the BIOS or HBA card firmware is going to faceplace right in between stage 1 and stage 2 bootloaders executing. If there were a problem there, the drive simply doesn’t show up and no part of the bootloader gets loaded.

  • I’m not familiar with that. How would I go about adapting the CentOS 6
    program.log commands?

    Definitely a strange problem. I’m hoping that doing a new install onto these drives rather than trying to inherit the install used on the smaller drives will work better.

  • Am 06.08.2015 um 22:21 schrieb Chris Murphy :

    on which OS (eg. c5, c6) was the partition created?

  • The CentOS installer, and parted, predate AF drives, so the partitioning will not be correct with a new installation. There’s no way to get the installer to do proper alignment. You can partition correctly in advance, and then have the installer reuse those partitions though.

  • For the OP, I think it was CentOS 5, but he only said it’s running CentOS 5 now.

    For my test, it was CentOS 6, but that uses the same version of GRUB
    legacy so the bootloader installation method for raid1 disks should be the same.

  • Ok. I’ll give that a try tomorrow. Just a couple of questions.

    install –stage2=/boot/grub/stage2 /grub/stage1 d (hd0,1) /grub/stage2 p
    (hd0,1)/grub/grub.conf

    It looks like this mixes paths relative to root and relative to /boot.
    Did your test system have a separate /boot partition? The –stage2
    argument is “os stage2 file” according to my man page. Should this be relative to root even with a separate /boot partition?

    Also, why are the exact same root and install commands run twice in the log you show? Is that just a duplicate, or does it need to be run twice for some reason?

    grub> root (hd0,1)
    grub> install –stage2=/boot/grub/stage2 /grub/stage1 d (hd0,1)
    /grub/stage2 p (hd0,1)/grub/grub.conf grub> root (hd0,1)
    grub> install –stage2=/boot/grub/stage2 /grub/stage1 d (hd0,1)
    /grub/stage2 p (hd0,1)/grub/grub.conf

  • Is that true? If I have a system with two disks, where device.map labels one as hd0 and the other as hd1, and I swap those numbers, the resulting boot sector will differ by one bit.

    My understanding was that those IDs are used to map to the BIOS disk ID. Stage 2 will be read from the partition specified to the grub installer at installation, as in:
    grub> install –stage2=/boot/grub/stage2 /grub/stage1 d
    (hd0,1)/grub/stage2 p (hd0,1)/grub/grub.conf

    Stage 1 and 1.5 will, as far as I know, always be on (hd0), but stage2
    might not be. If the BIOS device IDs specified in device.map weren’t written into the boot loader, how else would it know where to look for them?

  • Am 06.08.2015 um 22:43 schrieb Chris Murphy :

    C6’s grub has large inode support (and for sure a couple of others patches) which is missing in C5’s grub.

    Not sure if this is here the case but i got triggered while reading this thread because we had here problems with filesystems that where “premastered” on C6 systems to install C5 on it (KVM). The above mentioned missing large inode support was the problem.

  • Yes.

    I think it’s being treated as a directory because it’s going to access this stage2 file.

    I do not know. The whole thing is foreign to me. But both drives are bootable as hd0 (the only drive connected). So it makes sense that the configuration is treating this as an hd0 based installation of the bootloader to both drives. The part were the stage 1 and 2 are directed to separate drives must be the ‘device (hd0) /dev/vdb’
    command. Again, I don’t know why it isn’t either ‘device (hd0) (hd1)’
    or ‘device /dev/vda /dev/vdb’ but that’s what the log sayeth.

  • Hrmm I can’t reproduce this in a VM with identical drives. Are you sure stage 2 is in an identical location on both drives? That would account for a one bit (or more) of difference since GRUB’s stage 1
    contains an LBA to jump to, rather than depending on an MBR partition active bit (boot flag) to know where to go next.

    stage 1 cannot point to another drive at all. It’s sole purpose is to find stage 2, which must be on the same drive. Stage 1.5 is optional, and I’ve never seen it get used on Linux, mainly because in the time of GRUB legacy, I never encountered an installer that supported XFS
    for /boot.

    https://www.gnu.org/software/grub/manual/legacy/grub.html#Images

  • “device (hd0) /dev/vdb” overrides the data in device.map and instructs the grub shell to examine vdb where subsequent commands refer to (hd0).

    “root (hd0,1)” sets the location where grub will look for the required files, probably stage1, stage2, and e2fs_stage1_5.

    “install …” checks for required files and writes a modified copy of stage1 to the first block of the device following ‘d’. In your case, that should be the first partition on vdb.

    https://www.gnu.org/software/grub/manual/legacy/grub.html#install

    This will do the same thing, except that it will operate on /dev/vda.

    In both cases, stage1’s block list will refer to BIOS device 0 for the location of stage2, so that if the BIOS boots from that drive, grub will load stage2, and then the kernel and initrd from the same drive. That’s not necessarily the case, though. Stage 2 could be on a different drive.

  • OK I did a CentOS 5 installation to raid1s, and it also boots either drive fine. But there’s no log telling me how it installed the bootloader.

    http://ur1.ca/ndhf4

    At offset 0x1B8 you’ll notice 3 bytes of difference, but this is the disk signature. But what’s interesting, this is not grub stage 1. This is a simple parted boot strap code that looks for the active bit in the MBR (the boot flag). When I do this:

    http://ur1.ca/ndhi3

    There’s the grub stage1. It’s embedded in /boot (vda1) and (vdb1). When dd the 1st sector of vda1 and vdb1 to files, and diff the files, they’re identical. So they’re both pointing to the same LBA on each disk for stage 2.

    It really shouldn’t matter though, whether this three step jump method is used, or grub stage 1 is written to each MBR gap and from there to stage 2.

    I might try nerfing the parted and grub stage 1 bootloaders on disk2, and see if the grub shell (which I should still get to from disk 1)
    will let me install grub directly on these two drives properly.

  • Grub’s documentation is slightly unclear about that. Here’s a system where /boot is part of a RAID1 set on /dev/vdc1 and /dev/vdd1:

    [root@localhost ~]# cat /boot/grub/device.map
    # this device map was generated by anaconda
    (hd0) /dev/vda
    (hd2) /dev/vdc
    (hd3) /dev/vdd
    [root@localhost ~]# grub
    Probing devices to guess BIOS drives. This may take a long time.

    GNU GRUB version 0.97 (640K lower / 3072K upper memory)

    [ Minimal BASH-like line editing is supported. For the first
    word, TAB
    lists possible command completions. Anywhere else TAB lists the
    possible
    completions of a device/filename.]
    grub> root (hd2,0)
    root (hd2,0)
    Filesystem type is ext2fs, partition type 0xfd
    grub> setup (hd0)
    setup (hd0)
    Checking if “/boot/grub/stage1” exists… no
    Checking if “/grub/stage1” exists… yes
    Checking if “/grub/stage2” exists… yes
    Checking if “/grub/e2fs_stage1_5” exists… yes
    Running “embed /grub/e2fs_stage1_5 (hd0)”… 27 sectors are embedded.
    succeeded
    Running “install /grub/stage1 d (hd0) (hd0)1+27 p
    (hd2,0)/grub/stage2 /grub/grub.conf”… succeeded
    Done.
    grub>

    I believe the final line can be interpreted as:

    0: install: the install command
    1: /grub/stage1: path to the stage1 file, relative to the root
    2: d: grub will look for stage2_file at the address specified in arg 4
    3: (hd0): grub will be written to the first block of (hd0), currently mapped to /dev/vda
    4: (hd0)1+27: stage1_5 has been embedded to this location. It is being used as “stage2_file”
    5: p: the first block of stage2 will be modified with the value of the partition where stage2_file is found
    6: (hd2,0)/grub/stage2:
    7: /grub/grub.conf: because this arg is present and #4 is really a stage
    1.5, the stage2 config file is patched with this configuration file name.

    If I specify “root (hd3,0)” in the grub shell, the boot loader will differ at 0002032, where it will refer to BIOS device 3 instead of BIOS
    device 2 for the location of /grub/stage2.

    — vda.2 2015-08-06 16:05:32.039999919 -0700
    +++ vda.3 2015-08-06 16:05:59.441999927 -0700
    @@ -53,7 +53,7 @@
    *
    0001760 \0 \0 \0 \0 \0 \0 \0 \0 002 \0 \0 \0 032 \0 002
    0002000 352 p ” \0 \0 \0 003 002 377 377 377 \0 \0 \0 \0 \0
    -0002020 002 \0 0 . 9 7 \0 377 377 \0 202 / g r u b
    +0002020 002 \0 0 . 9 7 \0 377 377 \0 203 / g r u b
    0002040 / s t a g e 2 / g r u b / g r
    0002060 u b . c o n f \0 \0 \0 \0 \0 \0 \0 \0 \0
    0002100 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0

    …unless I’m mistaken. :)

  • OK I did that and this works.

    ## At the GRUB boot menu, hit c to get to a shell grub> root (hd0,0)
    grub> setup (hd0)
    grub> setup (hd1)

    That’s it.

    What I did to test is was I zero’d the first 440 bytes of vdb and the the first 512 bytes of vdb1. I confirmed that this disk alone does not boot at all. After running the above commands, either drive boots.

    NOW, I get to say I’ve seen stage 1.5 get used because when it did the setup, it said it was embedding /grub/e2fs_stage1_5. In the above case hd0,0 is first disk first partition which is /boot.

    Anyway, this seems about 8 million times easier than linux grub-install CLI. Now I get to look back at OP’s first email and see if he did this exact same thing already, and whether we’ve come full circle.

  • Shit. He did.

    All I can think of is that either the GRUB/BIOS device designations are wrong (they should be (hd2) or (hd3) I can’t actually tell how many drives are connected to this system when all of this is happening) so the bootloader is installing to a totally different drive. Or yeah, there is in fact some goofy incompatibility with an HBA where it gets to stage 1.5 and then implosion happens. *shrug*

  • I tried the grub commands you gave and still got the same results. I
    also have a copy of the SuperGrub disc, which is supposed to be able to fix grub problems. It can boot the drive, but it can’t fix it. If nothing else, I guess I could just leave that disc in the drive and use it to boot the system.

    I’m going to do a fresh install to the new drives and see if that works.

  • I suppose it’s worth a shot. But like I mentioned earlier, keep in mind that CentOS 5 predates AF drives, so it will not correctly partition these drives such that they have proper 8 sector alignment.

    If you haven’t already, check the logic board firmware and the HBA
    firmware for current updates.

  • The fresh install will be with CentOS 6. A quick test with a minimal install booted without any problems, so it looks like this is the solution.

    I try to avoid firmware updates on established systems unless absolutely necessary. Since the CentOS 6 install produces a bootable system, I’m going to leave it as-is.