CentOS 6 Spontaneous Reboots

Home » CentOS » CentOS 6 Spontaneous Reboots

May 29, 2016 Bill Gee CentOS 4 Comments

Hello everyone –

My CentOS 6.8 server has been rebooting itself every 2 to 4 hours for the last several days. I do not know where to look for logs that might give a clue what the problem is. There are no unusual entries in /var/log/messages. I looked over other log files in /var/log and found nothing suggestive. Where else can I
look?

By luck I saw the beginning of a reboot on the server console. Normally I have other systems up on the KVM switch. It appears to have dumped core. I don’t know where to look for the core dump files. They are not in /root.

The problem started while the server was still 6.7. It had almost 290 days of uptime when the problem started. I have tried the following, none of which made any difference.

I ran the upgrade to 6.8.

I tried stopping non-essential services two at a time.

I ran MemTest 86+. No memory errors were found.

I turned off swap.

I unplugged the USB hard drive that I use to hold daily and weekly backups. It was being recognized as /dev/sda for some reason.

Lm_sensors shows the processor running between 45 and 50C. Hddtemp shows the hard drives running between 35 and 40C. LM_Sensors does not produce valid data on fan speed, but a visual check shows all fans running normally and no build-up of dust.

The system is behind a big UPS that runs several other systems. The UPS log file does not record any power failures and none of the other systems are rebooting at random.

What else can I look at?

Thanks – Bill Gee

4 thoughts on - CentOS 6 Spontaneous Reboots

Keith Keller says:

May 29, 2016 at 7:50 pm

Hi Bill,

One place you might check is under /var/lib. I think there may be a
/var/lib/crash directory which contains core dumps.

Another option is to try Advanced Cluster Breakin, which runs other tests besides memory.

http://www.advancedclustering.com/products/software/breakin/

I’ve had it find problems that memtest hasn’t (and vice-versa).

If the system supports IPMI, check those sensors and logs, there may be something useful there. If you don’t have IPMI, there may still be something in the BIOS logs (how you get to those varies wildly, you may need to boot into the BIOS to do it).

I hope that helps!

–keith
Francis Mendoza says:

May 29, 2016 at 7:59 pm

Check the hardware system health it could be that there is a faulty component that triggering to reboot or maybe high temperature (overheated)
processor check your hardware fan if still working

—

This email or attachments may contain confidential or legally privileged information intended for the sole use of the addressee(s). Any use, redistribution, disclosure, or reproduction of this message, except as intended, is prohibited. If you received this email in error, please notify the sender and reformat your hard drive to remove all copies of the message, including any attachments; failure to do so may result in your floppy drive being filled with jelly. Any views or opinions expressed in this email (unless otherwise stated) may not represent those of the Vatican City, George W Bush, or the Sisters of the Perpetual Motion. Cheers [image:
]
Anthony K says:

May 29, 2016 at 10:20 pm

TL;DR

sar -m TEMP | less
(sar can be found in the sysstat package)

–
Bill Gee says:

May 30, 2016 at 2:47 pm

Hello everyone –

I found the core dumps. They are in /var/crash. This directory contains a directory for each crash, named by IP address-date-time. Each directory contains a vmcore and a vmcore-dmesg.txt file.

The vmcore-dmesg.txt files are mostly the kernel initialization stuff, same as you would see in dmesg. At the end, though, is some information about the process that was executing when the crash happened.

I reviewed several of those and found a common process – aiccu! That seems very odd since I have been running aiccu and Sixxs for over five years. It has never given me any trouble before. The package I have on this server came from the EPEL repository and has not changed for several years. The Sixxs web site also shows no change in aiccu for many years.

I also found, by chance, an operation that seems to always trigger the crash. If I
go to my main workstation (Fedora 23) and tell Akregator to “refresh all feeds”, that is guaranteed to produce a crash. There are probably other operations that can force a crash, but I have not found them.

For now I have turned off ipv6 forwarding and stopped the radvd service. That should keep aiccu from handling anything.

It is nice to know it is not some funky hardware problem. Still, it would be nice to have it working. Any thoughts?

Thanks – Bill Gee

core.

CentOS 6 Spontaneous Reboots

4 thoughts on - CentOS 6 Spontaneous Reboots

Recommended

Recent Posts

Recent Comments

Archives

Categories

Meta