Is anyone on the list using kerberized-nfs on any kind of scale?
I’ve been fighting with this for years. In general, when we have issues with this system, they are random and/or not repeatable. I’ve had very little luck with community support. I hope I don’t offend by saying that! Rather, my belief is that these problems are very niche/esoteric, and so beyond the scope of typical community support. But I’d be delighted to be proven wrong!
So this is more of a “meta” question: anyone out there have any general recommendations for how to get support on what I presume are niche problems specific to our environment? How is paid upstream support?
Just to give a little insight into our issues: we have an in-house-developed compute job dispatching system. Say a user has
100s of analysis jobs he wants to run, he submits them to a central master process, which in turn dispatches them to a “farm” of >100
compute nodes. All these nodes have two different krb5p NFS mounts, to which the jobs will read and write. So while the users can technically log in directly to the compute nodes, in practice they never do. The logins are only “implicit” when the job dispatching system does a behind-the-scenes SSH to kick off these processes.
Just to give some “flavor” to the kinds of issues we’re facing, what tends to crop up are one of three things:
(1) Random crashes. These are full-on kernel trace dumps followed by an automatic reboot. This was really bad under CentOS 5. A random kernel upgrade magically fixed it. It happens almost never under CentOS 6. But happens fairly frequently under CentOS 7. (We’re completely off CentOS 5 now, BTW.)
(2) Permission denied issues. I have user Kerberos tickets configured for 70 days. But there is clearly some kind of undocumented kernel caching going on. Looking at the Kerberos server logs, it looks like it “could” be a performance issue, as I see 100s of ticket requests within the same second when someone tries to launch a lot of jobs. Many of these will fail with “permission denied” but if they immediately re-try, it works. Related to this, I have been unable to figure out what creates and deletes the
(3) Kerberized NFS shares getting “stuck” for one or more users. We have another monitoring app (in-house developed) that, among other things, makes periodic checks of these NFS mounts. It does so by forking and doing a simple “ls” command. This is to ensure that these mounts are alive and well. Sometimes, the “ls” command gets stuck to the point where it can’t even be killed via “kill -9”. Only a reboot fixes it. But the mount is only stuck for the user running the monitoring app. Or sometimes the monitoring app is fine, but an actual user’s processes will get stuck in “D” state (in top, means waiting on IO), but everyone else’s jobs (and access to the kerberizes nfs shares) are OK.
This is actually blocking us from upgrading to CentOS 7. But my colleagues and I are at a loss how to solve this. So this post is really more of a semi-desperate plea for any kind of advice. What other resources might we consider? Paid support is not out of the question (within reason). Are there any “super specialist”
consultants out there who deal in Kerberized NFS?