Problem With Machine “freezing” For Short Periods

Home » CentOS » Problem With Machine “freezing” For Short Periods
CentOS 15 Comments

I have two HP dc7800 convertible minitowers that are exhibiting the following issue: every 5-10 minutes, they will “freeze” for about 30 seconds, and then pick right back up again. During the freeze, it seems that nothing at all happens on the system; the clock doesn’t even advance (it just picks up again with the next second, and that 30-or-so seconds are lost).

I’ve tried both CentOS 5.8 and 5.7, thinking it was a kernel incompatibility, but the problem happened with both versions. I have tried different hard drives, different memory, even swapped the entire machine, and the problem exists everywhere. I have tried adding “pci=nommconf” to the kernel line, as that was reported as being necessary back with 5.2 on these machines, but that made no difference (and shouldn’t be necessary now, anyway, as I believe the issue has either been fixed or worked-around).

I am stuck, and can’t figure out where to even suspect the problem might actually be. There are no errors getting logged anywhere that I can find, probably because everything just “stops” temporarily, so there’s nothing for the system to log.

Does anyone have any idea where I could look to fix this? I think I am next going to go back to 5.2, where the pci=nommconf is necessary, because at least back that far it appears to have been working for other people. However, I really would like to have this running 5.8.

Thanks!

15 thoughts on - Problem With Machine “freezing” For Short Periods

  • Vanhorn, Mike wrote:

    When you say “swapped the entire machine”, what did you do? Also, what’s running on them? Have you tried running top -d 10 or smaller (that will update the screen every 10 secs; I only recently found that current top allows tenths of a second.

    If you don’t see anything, I’d suggest you call HP, assuming they’re still under warranty.

    mark

  • I’ve several HP dc7x00 machines, and I’ve never seen that problem with CentOS 5 or 6.

    Do you also see the problem if you boot in runlevel 3, i.e. without X?

    Mogens

  • From: “m.roth@5-cent.us”

    They apparently do not support Linux on these models… So you might not get any help from HP support. Do you have the latest BIOS?
    Did you get a CD to run tests (like Insight Diagnostics Offline)?

    JD

  • *snip*

    Hi Mike. Are you on 32 or 64 bits ?

    If 32 bit you might like to take a look at this here, which I compiled and packaged for CentOS 5.5 32 bit – works on 5.8
    OK as well:

    Package Signing Key:
    http://www.karsites.net/CentOS/downloads/5.6/karsites-GPG-public-key-2011-03-18.asc

    32 bit binary RPM:
    http://www.karsites.net/CentOS/downloads/5.6/qps-1.9.18.6-1.i386.rpm

    Fedora 6 source code I rebuilt qps from:
    http://www.karsites.net/CentOS/downloads/5.6/qps-1.9.18.6-1.fc6.src.rpm

    If you click on the %MEM or %CPU headings, this will toggle the sort order of the running processes by highest to lowest and v/v for those headings – same applies to the other headings.

    Kind Regards,

    Keith Roberts

    ———————————————————

  • I have two of them, and thinking it was the hardware on the one, I moved the hard drive to the second, but the problem existed there, too. That points to something with the software, but, well, I haven’t found anything yet.

    I haven’t tried top, but that’s a good idea. I usually have one window open that is running uptime every second in a continuous loop, mainly to tell me when exactly it happens. Originally, when the problem was first noticed, we had VLSI software being run on it, but at this point, the only thing I have on the machine is the operating system, and I’m going through my step-by-step configuration until I notice the problem occurring.

  • I do, too. Things are fine on our 7900s, and the 8000-series machines we have. I’m only seeing it on these two 7800s.

    Yes. I was thinking it maybe had something to do with the graphics card, so I left it in runlevel 3, but the problem still persisted. It still may be the graphics card, though, come to think of it, so I may need to try taking it out.

  • 64. I have thought of trying 32 bit, just to see if it made a difference, but if it does, that won’t help me because we need 64 bits for the software we’re running, anyway.

  • As a followup, I’ve determined that it is network related, but I’m still not sure what the problem is. I did go back to CentOS 5.2, but the problem still exists with that version, too.

    Basically, what seems to be happening is that the network freezes around
    30 seconds, and then picks right back up. There are no errors in any logs that I can find, and process that are running locally and that only depend on local resources keep right on going and don’t have a problem.

    I have tried using a different network card (as opposed to the one on the motherboard), but the problem happens with that, too. It almost has to be a configuration issue, or a BIOS settting, but I don’t get it.

  • That sounds like a timeout of some kind. Do you have many (thousands per minute) of transient network connections in normal operation? If so, you might be running into the open file limits if you haven’t bumped up the limits.

    Look at /etc/security/limits.conf and try adding

    * – nofile 64000

  • It’s not necessarily network hw or sw that’s at fault. I once had a similar problem caused by the (3rd party) driver of the onboard “RAID” controller. Newer driver version fixed it.

  • It turned out to be something very simple, but which wasn’t obvious to check to begin with. There was another computer (a Windows machine) that was supposed to have been taken out of service a long time ago, but someone has recently put it back on the network. Because it was supposed to have been no longer used, it’s IP address was re-allocated (a year and a half ago!) to the machine that I have been agonizing over all week.

    On someone’s suggestion, I decided to put the problem PC on a different subnet, because we thought it might be something amiss with the new networking hardware that was installed a month or so ago, and suddenly the problem went away. Some more investigation, and we discovered that the IP
    address was still being used, and, thus, stumbled across the actual problem.

    Thank you to all who responded!

    It’s always the simplest things, in the last place you look…

  • Vanhorn, Mike wrote:

    Glad you got it. I’ve been *really* busy all morning – shutdown of chilled water at 0-dark-30 meant shutting all the servers down yesterday, then bringing them up once the chiller water came on – but when I saw that you’d found it was a network issue, I was literally about to respond that it might not be your system’s problem, but something on the network, when I saw your subject of SOLVED.

    Congrats.

    mark

  • Hi Mike.

    I’m pleased you got this figured out now OK.

    As you mentioned earlier it could be a network problem, I
    was going to suggest using Wireshark, which *could* have identified this problem for you pretty quick.

    http://en.wikipedia.org/wiki/Wireshark Wireshark is a free and open-source packet analyzer. It is used for network troubleshooting, analysis, software and communications protocol development, and education. Originally named Ethereal, in May 2006 the project was renamed Wireshark due to trademark issues.

    Name : wireshark Arch : i386
    Version : 1.0.15
    Release : 1.el5_6.4
    Size : 40 M
    Repo : installed (in updates repo)
    Summary : Network traffic analyzer URL : http://www.wireshark.org/
    License : GPL

    Name : wireshark-gnome Arch : i386
    Version : 1.0.15
    Release : 1.el5_6.4
    Size : 1.6 M
    Repo : installed Summary : Gnome desktop integration for wireshark and
    : wireshark-usermode URL : http://www.wireshark.org/
    License : GPL
    Description: Contains wireshark for Gnome 2 and desktop
    : integration file

    Maybe you could recreate this problem (2 machines using the same IP address on the same network ) and then start Wireshark GUI, and see if it spots this and complains with a very informative error message?

    Kind Regards,

    Keith Roberts

    ———————————————————

LEAVE A COMMENT