Lockups With Kernel-2.6.32-358.0.1.el6.i686

Home » CentOS » Lockups With Kernel-2.6.32-358.0.1.el6.i686
CentOS 25 Comments

I updated my home server with the 6.4 CR packages, and I’ve experienced
3 or 4 hard lockups since. The server is a fanless VIA C7
“CentaurHauls” system with a 1GHz CPU underclocked to 800MHz and 1GB of RAM. It has a dual-port Intel 82546GB NIC in its single PCI slot. (It also has an on-board Realtek RTL-8110SC/8169SC NIC that is plugged in, but doesn’t currently have an IP address configured.)

This server provides a number of services — DNS, DHCP, routing between VLANs, DLNA media server, CUPS, etc. Most importantly, it runs Asterisk and manages all of the phones in the house.

There’s absolutely nothing in the logs related to the lockup. The system simply becomes totally unresponsive, to the point that the console cursor stops blinking. A hard reset is required to bring it back.

kernel-2.6.32-279.22.1.el6.i686 seems to be completely stable.

I don’t really expect to be able to figure this out, but I thought I’d post here to see if anyone else is experiencing anything like this with this kernel.

Thanks!

25 thoughts on - Lockups With Kernel-2.6.32-358.0.1.el6.i686

  • I’m running 2.6.32-358.0.1 on a KVM virtual machine and not seeing any issues. I’ve not yet ran that kernel on physical hardware yet though.

  • Wow. I’m trying to troubleshoot a very similar problem. I was convinced that it was hardware, but beginning to exhaust my hardware troubleshooting skills.

    I’m running an Asus M5a99X EVO 2.0, Asus Geforce GTX 660, and AMD 8150
    CPU, 32G RAM, Corsair 850W PS. Randomly I get a complete lockup. Mouse freezes, network dies, etc..

    Same here. No log messages, just a complete freeze. At first I was suspecting some Pulseaudio glitches because of thousands of messages in the log. Then suspected the proprietary NVidia graphics, then thought it might be power supply. I’ve since swapped out every component with no improvement. It can sometimes for for hours without a problem, sometimes with a minute after a reboot it will lock up.

    Have you enabled your thermal sensors? Do you have any messages in the kernel log?

  • I have 2 of these motherboards (ASUS M5A99X EVO R2.0) that I am using in CentOS development and testing. I am not seeing this issue .. I have
    “M5A99X EVO R2.0 BIOS 1503” dated “2013/01/31 update”.

    Do you have the latest BIOS?

  • Thank you for your reply.

    Yes, latest BIOS installed. I have 2 of these also with similar configurations except for the NIC. One works perfectly the other has constant freezes. The working one has a slightly older BIOS so I’m thinking of downgrading the giltchy one.

    As far as logging goes, any idea what sort of failures could cause such a lockup? I.e., if memory was failing, would the system still be able to log? As the mouse is frozen and kernel sysrq has no effect, I’m still leaning towards hardware but literally everything except the case has been swapped out. (Well.. let me qualify that.. Everything but the 64GB SSD drive has been swapped but it seemed unlikely that a drive failure could cause such a lockup. Incorrect assumption?)

  • Kwan Lowe wrote:

    No ideas… and I’ve had a number of systems do this, over the last couple years, where someone noted it had stopped responding; I go down, and it doesn’t respond *at* *all* when I plug in a monitor & keyboard, and power cycling’s the only answer.

    Thinking about it, I believe it’s mostly been on our Penguin servers, and that co. uses Supermicro m/b’s, and we’ve had h/w problems with them, also, and have had several m/b’s replaced under warranty.

    mark

  • Nearly every time we’ve had lockup problems it has come down to bad or failing memory.

    I’ve even had memory cause problems where it would pass a quick memtest but ultimately would fail if you left it running the tests overnight.

  • Gerry Reno wrote:

    Right, but I’ve always *seen* error messages, dmesg, and, if mcelogd is actually working (I can’t figure out why it seems to on some machines, and not on others, or why it doesn’t keep running), it’s in there. The times we’ve had lockups, there’s been nothing.

    mark

  • Thank you for your reply.

    I was leaning towards memory after swapping the power supply did not solve the problem. There are 4 8GB DDR3 sticks, so I took out two and ran with 16G. It still failed. I then swapped that out for the other
    16GB. Still failed. What I haven’t tried is to downclock the memory to a slower speed but will try that tonight if the BIOS supports it.

  • That’s the frustrating thing.. Not a single error message. It also appears unrelated to system load as I went through 4 hours of the Phoronix test suite that pegged all 8 cores, Unigine Valley benchmark for several loops, memtest.. All passed. But at night it locked up when there was no load.

  • Sure sounds like a memory related lock up since you’ve ruled out the power supply.

    Are you able to boot the system with memory in the second pair of slots?

    If it’s not memory related (test this memory in another system) then it is probably a motherboard failure. I’ve seen weird symptoms where the system will boot fine, but once the Linux kernel begins to build its cache it triggers a lock up/throws an exception.

    In that case the memory controller was probably going so that ancient system got thrown out (was not in production). In that case the system previously had a proprietary Linux 2.2 kernel and a 2.4 or 2.6 kernel would cause it to wig out. Differences in how a 2.2 and 2.4/2.6 kernel allocates memory really brought out the problem in that system!

    But to be sure, run a memtest overnight on the original 4x8GB RAM as has been recommended by others.

  • SilverTip257 wrote:



    I lean towards the m/b failing. Btw, the Penguins I’ve mentioned that had m/b’s replaced – most of them, we can run a *user* program (parallel processing using torque, very heavy duty scientific computing), and it will crash the system, through reboot, repeatably. We’ve shipped them back, and they wind up replacing the m/b.

    mark

  • :) I’ve also swapped the motherboard. *Every* component except for the case and the SSD boot drive has been swapped. This is going on now for almost two weeks.

    I will try your suggestion of trying a separate set of banks in the off chance that those slots are faulty.

  • A diagnostic board should, at least, limit the search space. Characterizing the tyoe(s)/point(s) of failure may make it possible to handle them more gracefully.

  • Kwan Lowe wrote:
    cache it

    Oh, that’s *bad*. We had a server like that, a Dell (fortunately): three replacement m/b’s (one was DoA), and we *still* couldn’t get it to boot, so they offered us a newer server as a replacement (all within two weeks, and that *includes* the three days that FE’s showed – that’s why I’m glad it was Dell).

    mark

  • I had one a few years ago where it took about 3 days for memtest to catch the bad RAM but even after fixing that there were random crashes. Turned out that the bad RAM had caused some disk corruption which was partly hidden by raid1 mirroring. Once in a while a program block read would hit the bad copy, but when you look for it everything looks OK…

  • I tell you of one very stable system that was not stable the other day. It was locking up in half hour frequency after running stable for years. It turned out that the temperature was not monitored on this system, the cpu fan got angry about this fact, stopped to work and it was getting hot. After replacing the fan you might think *problem solved* but nah. It kept locking up. It turned out that an adapter for the power supply had a loose contact. Do you think that think loose contact could have been introduced while fixing the fan?

  • The board has a couple of buttons on it to find the best memory timings, etc.

    The button is labeled Memory OK! .. and is on the top right corner of the board.

  • Just a wild idea: is the NIC in the system that freezes a Broadcom and in the other system something else? If so, disable_msi=1 may help.

    Steve

  • I’m running on the second bank now. I ran into a snag running mcelogd however (processor might not be supported). It appears that the CPU is not supported even after enabling the CONFIG_EDAC_MCE and CONFIG_EDAC_AMD64 in the /boot/config-xxx.. The error sometimes takes a few hours to occur so will use this system throughout the night to try to catch the failure.

    Starting mcelog daemon [FAILED]
    AMD Processor family 21: Please load edac_mce_amd module. CPU is unsupported

  • NICs are now both ThinkPenguin cards with an Atheros chipset.. At this point, the systems are identical except that the failing one has an even bigger PSU than is needed (I calculated 650W required and had an 850W in there… Now it’s a 1200W :D ).

  • Well.. Looks like my hardware problems were only superficially the same as yours. After fighting it for two weeks, I got the second replacement motherboard in on Tuesday. Swapped it out and it has been rock solid stable since then. At some point I may try bringing up the BIOS to the same version as on the failed board if someone has a similar problem, but for now it’s staying at the back rev version.

LEAVE A COMMENT