Another Bizarre Thing…

Home » CentOS » Another Bizarre Thing…

August 5, 2019 fred smith CentOS 22 Comments

Hi all!

I’m stuck on something really bizarre that is happening to a product I “own” at work. It’s a C program, built on CentOS, runs on CentOS or RHEL, has been in circulation since the early 00’s, is in use at hundreds of sites.

recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its
(several) log files. it’s just gone.

running it under strace until it dies reveals that every thread has been given a SIGKILL.

How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can’t do that for SIGKILL because the app doesn’t survive the signal.

I’m grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to “beats me” which is where I am now).

I’m even wondering if systemd has something to do with it.

Thanks in advance!

22 thoughts on - Another Bizarre Thing…

Grant Street says:

August 5, 2019 at 8:55 pm

Try checking your /var/log/messages for OOM killer log lines. If your machine is running low on memory the oom killer will start killing high memory usage programs.

Grant
fred smith says:

August 5, 2019 at 10:02 pm

we have watched top while it runs and there’s no evidence of a memory shortage.
RC says:

August 5, 2019 at 10:41 pm

“has been in circulation since the early 00’s” I assume it is not the same binary since ’00?

SIGKILL usually comes from the kernel. is selinux enabled? Does the application start “automatically”, or is it started by a user?

Ron
Warren Young says:

August 6, 2019 at 6:28 am

That’s nowhere near sufficient. To restore classic core file dumps on CentOS 7, you must:

1. Remove Red Hat’s ABRT system, which wants to catch all of this and handle it directly. Say something like “sudo yum remove abrt*”

2. Override the default sysctl telling where core dumps land by writing this file, /etc/sysctl.d/10-core.conf:

kernel.core_pattern = /tmp/core-%e-%p
kernel.core_uses_pid = 1
fs.suid_dumpable = 2

Then apply those settings with “sudo sysctl —system”.

I don’t remember what the default is, which this overrides, but I definitely didn’t want it.

You can choose any pattern you like, just remember what permissions the service runs under, because that’s the permission needed by the process that actually dumps the core to make the file hit the disk. That’s why I chose /tmp in this example: anyone can write there.

3. Raise the limits by writing the following to /etc/security/limits.d/10-core.conf:

* hard core unlimited
* soft core unlimited

If this is what you meant by “ulimit,” then great, but I suspect you actually meant “ulimit -c unlimited”, but I believe until you do the above, the ulimit CLI app can have no effect. You have to log out and back in to make this take effect.

Once the above is done, “ulimit -c unlimited” can take effect, but it’s of no value at all in conjunction with systemd services, for example, since those don’t run under a standard shell, so your .bash_profile and such aren’t even exec’d.

4. If your program is launched via systemd, then you must edit /etc/systemd/system.conf and set

DefaultLimitCORE=infinity

then say “sudo systemctl daemon-reeexec”

Case matters; “Core” won’t work. Ask me how I know. :)

5. If you have a systemd unit file for your service, you have to set a related value in there as well:

LimitCore=infinity

You need both because #4 sets the system-wide cap, while this sets the per-service value, which can go no higher than the system cap.

6. Restart the service to apply the above two changes.

Yes, it really is that difficult to enable classic core dumps on CentOS 7. You’re welcome. :)
Pete Biggs says:

August 6, 2019 at 6:36 am

I was under the impression that a SIGKILL doesn’t trigger a core dump anyway. It just kills the process.

P.
Warren Young says:

August 6, 2019 at 6:38 am

True; you need SIGABRT to force a core to drop.

I posted that because if all he did was set the shell’s ulimit value, the lack of core files proves nothing, because there’s half a dozen other things that could be preventing them from dropping.
Johnny Hughes says:

August 6, 2019 at 8:17 am

Wow, thanks for the detailed recipe!

How did we deserve this when it was so easy in the past :-)
fred smith says:

August 6, 2019 at 8:59 am

yeah, I meant “ulimit -c unlimited” is in effect.

I had no idea systemd had made such a drastic change. or is it that someone at RH decided to make it (nearly) impossible to do? I fail to see how it is beneficial to anyone to make it so hard to get core dump files.

but thanks for the details!

Fred

—
—- Fred Smith — fredex@fcshome.stoneham.ma.us —————————–
“For him who is able to keep you from falling and to present you before his glorious presence without fault and with great joy–to the only God our Savior be glory, majesty, power and authority, through Jesus Christ our Lord, before
all ages, now and forevermore! Amen.”
—————————– Jude 1:24,25 (niv) —————————–
James Pearson says:

August 6, 2019 at 11:01 am

Fred Smith wrote:

I had an issue a few years ago where ‘something’ was killing processes –
I found it by writing a simple LD_PRELOAD hack that intercepted kill(2)
and logged what is was doing via syslog before doing the actual kill –
and used /etc/ld.so.preload to get it loaded by every process …

James Pearson
fred smith says:

August 6, 2019 at 1:21 pm

James:

After posting my original mail, I found this URL:

https://www.ibm.com/developerworks/community/blogs/aimsupport/entry/Finding_the_source_of_signals_on_Linux_with_strace_auditd_or_Systemtap?lang=en

which shows a very simple recipe for programming system tap to report sigkills, the UID that sends it, and the target process. We’ve asked the customer who is helping troubleshoot to implement that and get back to us with the result.

I suspect systemd has something to do with it, but I have absolutely no evidence, just a nagging feeling that since it has its little fingers in all the pies, it could be doing anything and I’d have no way of knowing. :(

I try not to be one of the systemd bashers, but I seem to be losing that battle.

Fred
Warren Young says:

August 6, 2019 at 4:18 pm

That only affects the shell it’s set for, which isn’t generally important for a service, since we no longer start services via shell scripts in the systemd world.

This isn’t a systemd change, it’s a *system* change. The only reason systemd is involved is that it also has its own defaults, just as your shell does, overridden by the ulimit command. Steps 1-3 remove the system limits, then 4 & 5 remove the systemd limits under that, which can affect your service, if it’s being started via systemd.

Core dumps are a security risk. They’re memory images of running processes. If you configure your server like I give in my recipe, every process that drops core will create a world-readable file in /tmp showing that process’s memory state, which means you can recover everything it was doing at the time of the crash.

So, if you can find a way to make, say, PAM or sshd drop core, you’ll get live login details in debuggable form, available to anyone who can log into that box.

You definitely want core dumps off by default.

Making core dumps enabled by default is about as sensible as enabling rsh by default.

https://en.wikipedia.org/wiki/Remote_Shell

We stopped doing that on production servers about 20-30 years ago, for more or less the same reason.
fred smith says:

August 6, 2019 at 9:48 pm

Oh of course. duh!

What we’ve alwayws done with this program is to put “ulimit -c unlimited”
in the script that sets its environment then starts the program itself. that minimizes the attack surface.

Setting up as you described earlier, is there a way to allow only a single program to drop core?

—
—- Fred Smith — fredex@fcshome.stoneham.ma.us —————————–
“For him who is able to keep you from falling and to present you before his glorious presence without fault and with great joy–to the only God our Savior be glory, majesty, power and authority, through Jesus Christ our Lord, before
all ages, now and forevermore! Amen.”
—————————– Jude 1:24,25 (niv) —————————–
Warren Young says:

August 6, 2019 at 10:02 pm

Of course.

The * in the limits.d file is a “domain” value you can adjust to suit:

https://www.thegeekdiary.com/understanding-etc-security-limits-conf-file-to-set-ulimit/

You’d have to read the systemd docs to figure out the defaults for LimitCore, but I suspect you don’t get cores until you set this on a per-service basis.

You can also adjust the sysctl pattern path to put cores somewhere secure. That’s the normal use of absolute paths: put the cores into a dropbox directory that only root can read but anyone can write to.

Also, I should point out that my first step, removing ABRT, is a heavy-handed method. Maybe what you *actually* want to do is learn to cooperate with ABRT rather than rip it out entirely.
fred smith says:

August 7, 2019 at 11:59 am

how about “simply” disabling and stopping it?

—
—- Fred Smith — fredex@fcshome.stoneham.ma.us —————————–
The eyes of the Lord are everywhere,
keeping watch on the wicked and the good.
—————————– Proverbs 15:3 (niv) —————————–
fred smith says:

August 7, 2019 at 12:39 pm

OK, more information.

Found a recipe to cause systemtap to emit a line of text identifying the sender of the SIGKILL.

probe signal.send {
if (sig_name == “SIGKILL”)
printf(“%s was sent to %s (pid:%d) by %s uid:%d\n”,
sig_name, pid_name, sig_pid, execname(), uid())

unfortunately, it says the program is killing itself:

SIGKILL was sent to myprog (pid:12269) by myprog uid:1000

So,… now I’m wondering how one figures that out. nowhere in my source code does it explicitly raise any signal, much less SIGKILL. So there must be some underlying library or system call or something doing it.
Frank Cox says:

August 7, 2019 at 12:56 pm

Since it’s your program you have the source code.

printf is your friend.

Start adding printf statements (to console and/or to a file at your option) with status reports (“widget counting executing”, “addition function executing”, “huge explosion executing”) and use that to find out where it quits. Add more printf’s as needed to narrow it down.
Young, Gregory says:

August 8, 2019 at 12:06 pm

Is this on both EL6 and EL7? If only EL7, it could be control groups causing the issue. The idea of cgroups is to prevent zombie processes, but if you need your program to spawn another process then restart itself while the other process continues to run, you need to launch it in a different control group, or the shutdown of the parent process will also kill the child. In my case, we have an upgrade script which needs to get called, then shut down the calling process in order to upgrade it. For example:

# Clear any errors in the upgrade control group.
/bin/systemctl reset-failed upgrade-trigger

# Launch the upgrader in its own control group.
/bin/systemd-run –unit=upgrade-trigger –slice=upgrade-trigger /bin/bash /opt/myapp/Upgrade.sh “$1” “$2”

If we don’t do this, the upgrade fails as the upgrader get’s terminated when the parent application is shut down.

Gregory Young

—–Original Message—
fred smith says:

August 8, 2019 at 6:48 pm

well, we aren’t INTENTINALLY using control groups. do we get put into one by the very act of launching a program w hich then creates threads, and they then all coexist until they’re told to stop?

I think it’s not the scenario you describe, the main program launches from an init script, does some sanity checks, loads some config files, then spawns the number of threads defined by its configuration. then all the threads, including the main prog, hang around doing stuff until they’re told to stop, which happens all at once for all of them. On a good day, anyway. what is happening now is they will all run fine for some time (anhour or twelve) then they all receive a SIGKILL.

Accordiing to a systemtap script I found online, it thinks the program is killing itself, but as the guy who wrote it, I don’t think so. the script can be seen below in earlier mail.

As for if it also fails on C6, I don’t know. I’ve asked our support team to see if they have a C6/EL6 customer who will let them install the latest version for 6 and see what happens, but so far, no joy.

Fred
Young, Gregory says:

August 9, 2019 at 9:24 am

Hi Fred,

Yep, that’s exactly how control groups work in CentOS 7. You don’t need to define them (normally), they get assigned when the init script or systemd service launches it. As I mentioned, the idea is to ensure none of those child threads become zombies if the parent dies/crashes/gets killed. For troubleshooting, you could try moving the child threads into their own cgroup, which might help reduce the noise when the parent process gets killed. Of course, you will have to manually kill the child processes during this testing, but it might clear enough of the strace logging for you to see where the parent process is getting killed. Don’t forget to undo this debugging step when done, or you will end up with zombies when you legitimately want to shut down the process.

Also, if you haven’t already, you may want to convert it to use the systemd “.service” file launching. It gives you a lot of control over startup timeouts, restarts, shutdown commands, process branching, etc. if nothing else, it might help you identify when the process dies, and restart it without intervention…

Gregory Young

—–Original Message—
Jobst Schmalenbach says:

August 11, 2019 at 7:16 pm

Late to the thread but since it has not been suggested: Have you tried to statically link all libs?

Then use Frank Cox’s suggestion to use printf’s at location thoughout the source code.

I know it will be big (depending on the number of libs)
But this way you are sure that the compile is against a known (yours) set of libs!

Also have you recompiled it and given the new binaries to the customers?

Just an idea ..
fred smith says:

August 11, 2019 at 7:53 pm

I doubt modern Linux systems will produce a fully-static binary, since many of the system libs come only as .so files.

Yes, every time there’s a new RHEL/CentOS version released it gets completely rebuilt on that new release. I don’t depend on compatibility between releases. Not to mention as maintenance and feeping-creaturism*
strikes.

* for those not in the know: feeping-creaturism ==> creeping featurism
Jobst Schmalenbach says:

August 12, 2019 at 9:38 pm

I know that. It’s just how keen you are to find the reason … especially if you have no control what libraries (even i686) are installed on the other machines.

Depening how many libraries the binary uses you could download them and use those as source for inclusion. You could omit the obvious libs for starters … and then even include those if still crashing. You only need to distribute those binaries to the people who have problems …

If a couple of those customers (failing progs) are helpful get a “yum list installed” and scan the list of libs and see whether sth might raise eyebrows.

For example I had one of my machines failing on one prog because it had “glibc.i686” installed due to ftdi. I changed the program using the ftdi libs to use full x86_64 (took me a few hours) and unstinalled the “glibc.i686” and suddenly the other prog had no problems!

I know you cant tell people to un-install but static linking MIGHT help.

Another Bizarre Thing…

22 thoughts on - Another Bizarre Thing…

Recommended

Recent Posts

Recent Comments

Archives

Categories

Meta