NFS Help

Home » CentOS » NFS Help
CentOS 34 Comments

We have 1 system ruining CentOS7 that is the NFS server. There are 50
external machines that FTP files to this server fairly continuously.

We have another system running CentOS6 that mounts the partition the files are FTP-ed to using NFS.

There is a python script running on the NFS client machine that is reading these files and moving them to a new dir on the same file system (a mv not a cp).

Almost daily this script hangs while reading a file – sometimes it never comes back and cannot be killed, even with -9. Other times it hangs for 1/2
hour then proceeds on.

Coinciding with the hanging I see this message on the NFS server host:

nfsd: peername failed (error 107)

And on the NFS client host I see this:

nfs: V4 server returned a bad sequence-id nfs state manager – check lease failed on NFSv4 server with error 5

The first client message is always at the same time as the hanging starts. The second client message comes 20 minutes later. The server message comes 4 minutes after that. Then 3 minutes later the script un-hangs (if it’s going to).

Can anyone shed any light on to what could be happening here and/or what I
could do to alleviate these issues and stop the script from hanging?
Perhaps some NFS config settings? We do not have any, so we are using the defaults.

Thanks!

34 thoughts on - NFS Help

  • Sorry for being dense, but I am not a sys admin, I am programmer and we have no sys admin. I don’t know what you mean by your question. I
    am NFS mounting to what ever the default filesystem would be on a CentOS6 system.

  • Larry Martell wrote:

    This *is* a sysadmin issue. Each partition is formatted as a specific type of filesystem. The standard Linux filesystems for Upsteam-descended have been ext3, then ext4, and now xfs. Tools to manipulate xfs will not work with extx, and vice versa.

    cat /etc/fstab on the systems, and see what they are. If either is xfs, and assuming that the systems are on UPSes, then the fstab which controls drive mounting on a system should have, instead of “defaults”, nobarrier,inode64.

    Note that the inode64 is relevant if the filesystem is > 2TB.

    The reason I say this is that we we started rolling out CentOS 7, we tried to put one of our user’s home directory on one, and it was a disaster.
    100% repeatedly, untarring a 100M tarfile onto an nfs-mounted drive took seven minutes, where before, it had taken 30 seconds. Timed. It took us months to discover that NFS 4 tries to make transactions atomic, which is fine if you’re worrying about losing power or connectivity. If you’re on a UPS, and hardwired, adding the nobarrier immediately brought it down to 40
    seconds or so.

    mark

  • I know it’s a sysadmin issue. I wish we had one, but we don’t and I am the one being asked to fix things.

    I have no remote access to this system, only on site, so it will have to wait until Monday for me to check. (The system is in Japan and I
    traveled from NY to Japan, where I am now, just to fix this issue.)

    Thanks for the reply.

  • To be clear: the python script is moving files on the same NFS file system? E.g., something like

    mv /mnt/nfs-server/dir1/file /mnt/nfs-server/dir2/file

    where /mnt/nfs-server is the mount point of the NFS server on the client machine?

    Or are you moving files from the CentOS 7 NFS server to the CentOS 6 NFS client?

    If the former, i.e., you are moving files to and from the same system, is it possible to completely eliminate the C6 client system, and just set up a local script on the C7 server that does the file moves? That would cut out a lot of complexity, and also improve performance dramatically.

    Also, what is the size range of these files? Are they fairly small
    (e.g. 10s of MB or less), medium-ish (100s of MB) or large (>1GB)?

    Timeouts relating to NFS are the worst.

    I’ve been wrangling with NFS for years, but unfortunately those particular messages don’t ring a bell.

    The first thing that came to my mind is: how does the Python script running on the C6 client know that the FTP upload to the C7 server is complete? In other words, if someone is uploading “fileA”, and the Python script starts to move “fileA” before the upload is complete, then at best you’re setting yourself up for all kinds of confusion, and at worst file truncation and/or corruption.

    Making a pure guess about those particular errors: is there any chance there is a network issue between the C7 server and the C6 client?
    What is the connection between those two servers? Are they physically adjacent to each other and on the same subnet? Or are they on opposite ends of the globe connected through the Internet?

    Clearly two machines on the same subnet, separated only by one switch is the simplest case (i.e. the kind of simple LAN one might have in his home). But once you start crossing subnets, then routing configs come into play. And maybe you’re using hostnames rather than IP
    addresses directly, so then name resolution comes into play (DNS or
    /etc/hosts). And each switch hop you add requires that not only your server network config needs to be correct, but also your switch config needs to be correct as well. And if you’re going over the Internet, well… I’d probably try really hard to not use NFS in that case! :)

    Do you know if your NFS mount is using TCP or UDP? On the client you can do something like this:

    grep nfs /proc/mounts | less -S

    And then look at what the “proto=XXX” says. I expect it will be either “tcp” or “udp”. If it’s UDP, modify your /etc/fstab so that the options for that mountpoint include “proto=tcp”. I *think* the default is now TCP, so this may be a non-starter. But the point is, based purely on the conjecture that you might have an unreliable network, TCP would be a better fit.

    I hate to simply say “RTFM”, but NFS is complex, and I still go back and re-read the NFS man page (“man nfs”). This document is long and very dense, but it’s worth at least being familiar with its content.

    In my experience, delays that happen on consistent time intervals that are on the order of minutes tend to smell of some kind of timeout scenario. So the question is, what triggers the timeout state?

    My general rule of thumb is “defaults are generally good enough; make changes only if you understand their implications and you know you need them (or temporarily as a diagnostic tool)”.

    But anyway, my hunch is that there might be a network issue. So I’d actually start with basic network troubleshooting. Do an “ifconfig”
    on both machines: do you see any drops or interface errors? Do
    “ethtool ” on both machines to make sure both are linked up at the correct speed and duplex. Use a tool like netperf to check bandwidth between both hosts. Look at the actual detailed stats, do you see huge outliers or timeouts? Do the test with both TCP and UDP, performance should be similar with a (typically) slight gain with UDP. Do you see drops with UDP?

    What’s the state of the hardware? Are they ancient machines cobbled together from spare parts, or reasonable decent machines? Do they have adequate cooling and power? Is there any chance they are overheating (even briefly) or possibly being fed unclean power (e.g. small voltage aberrations)?

    Oh, also, look at the load on the two machines… are these purpose-built servers, or are they used for other numerous tasks?
    Perhaps one or both is overloaded. top is the tool we use instinctively, but also take a look at vmstat and iostat. Oh, also check “free”, make sure neither machine is swapping (thrashing). If you’re not already doing this, I would recommend setting up “sar”
    (from the package “sysstat”) and setting up more granular logging than the default. sar is kind of like a continuous iostat+free+top+vmstat+other system load tools rolled into one that continually writes this information to a database. So for example, next time this thing happens, you can look at the sar logs to see if any particular metric went significantly out-of-whack.

    That’s all I can think of for now. Best of luck. You have my sympathy… I’ve been administering Linux both as a hobby and professionally for longer than I care to admit, and NFS still scares me. Just be thankful you’re not using Kerberized NFS. ;)

  • I’ve done a lot of ‘lite’ sysadmin in my time, but it was way back in the Unix days on SunOS and even before that. I actually used to have that book, as well as the ‘Devil’ book. But I’ve moved many times, and got rid of all my books as I was googling everything anyway. I have worked with many really great sysadmins in my time, and I know how valuable they are. So I would never pass myself off as one.

    A client of mine (who will remain nameless), has a client (who will also remain nameless), and they are so concerned about corporate espionage and loss of trade secrets, they do no permit any remote access to their systems. They site is so locked down, you cannot bring in any device of any kind.

    Thanks!

  • Larry Martell wrote:
    50 external machines that FTP files to this server fairly continuously. files are FTP-ed to using NFS. we have no sys admin. I don’t know what you mean by your question. I
    am NFS mounting to what ever the default filesystem would be on a CentOS6 system. the one being asked to fix things.

    Seriously, get a really good book on doing sysadmin work. 10 years ago, I
    was recommending O’Reilly’s one, “Essential Systems Administration”, by Aeleen Frisch. It saved my butt, I assure you, 21 years ago, when I had the same thing happen. have been ext3, then ext4, and now xfs. Tools to manipulate xfs will not work with extx, and vice versa.

    wait until Monday for me to check. (The system is in Japan and I
    traveled from NY to Japan, where I am now, just to fix this issue.)

    Oh, geez, they want you to fix this… but you can’t SSH in? This is *so*
    stupid. Here, fix this, and before you start, let me put an eyepatch over one eye, ties one hand behind your back, and prevent you from using manpages….

    Best of luck, guy.

    mark

  • Hi Matt-

    Thank you for this very detailed and thoughtful reply.

    Correct.

    No the files are FTP-ed to the CentOS 7 NFS server and then processed and moved on the CentOS 6 NFS client.

    The problem doing that is the files are processed and loaded to MySQL
    and then moved by a script that uses the Django ORM, and neither django, nor any of the other python packages needed are installed on the server. And since the server does not have an external internet connection (as I mentioned in my reply to Mark) getting it set up would require a large amount of effort.

    Also, we have this exact same setup on over 10 other systems, and it is only this one that is having a problem. The one difference with this one is that the sever is CentOS7 – on all the other systems both the NFS server and client are CentOS6.

    Small – They range in size from about 100K to 6M.

    The python script checks the modification time of the file, and only if it has not been modified in more then 2 minutes does it process it. Otherwise it skips it and waits for the next run to potentially process it. Also, the script can tell if the file is incomplete in a few different ways. So if it has not been modified in more then 2
    minutes, the script starts to process it, but if it finds that it’s incomplete it aborts the processing and leaves it for next time.

    Actually both the client and server are virtual machines running on one physical machine. The physical machine is running CentOS7. There is nothing else running on the physical machine other then the 2 VMs.

    I assume TCP, but I will check tomorrow when I am on site.

    Yes, I agree. I skimmed it last week, but I will look at it in detail tomorrow.

    I would like to try increasing the timeout.

    The hardware is new, and is in a rack in a server room with adequate and monitored cooling and power. But I just found out from someone on site that there is a disk failure, which happened back on Sept 3. The system uses RAID, but I don’t know what level. I was told it can tolerate 3 disk failures and still keep working, but personally, I
    think all bets are off until the disk has been replaced. That should happen in the next day or 2, so we shall see.

    I’ve been watching and monitoring the machines for 2 days and neither one has had a large CPU load, not has been using much memory.

    That is a good idea, I will check those logs, and set up better logging.

    Thanks!
    Larry

  • I misspoke here – the CentOS7 NFS server is running on the physical hardware, it’s not a VM. The CentOS6 client is a VM.

  • The server is xfs (the client is nfs). The server does have inode64
    specified, but not nobarrier.

    The file system is 51TB.

    We are not seeing a performance issue – do you think nobarrier would help with our lock up issue? I wanted to try it but my client did not want me to make any changes until we got the bad disk replaced. Unfortunately that will not happen until Wednesday.

  • On the client the settings are:

    nfs4 rw.realtime,vers=4,rsize48576,wsize4576,namelen%5,hard,proto=trcp,port=0,timeo`0,retrans=2,sec=sys,clientaddr2.168.2.200,minorversion=0,locallock=none,addr2.168.2.102
    0 0

    On the server the settings are:

    xfs rw,realtme,inode64,noquota 0 0

    Do these look OK?

    None on the client. On the server it has 1 dropped Rx packet.

    That reports only “Link detected: yes” for both client and server.

    sar seems to be running, but I can only get it to report on the current day. The man page shows start and end time options, but is there a way to specify the stand and end date?

  • Using “nobarrier” might increase overall write throughput, but it removes an important integrity feature, increasing the risk of filesystem corruption on power loss. I wouldn’t recommend doing that unless your system is on a UPS, and you’ve tested and verified that it will perform an orderly shutdown when the UPS is on battery power and its charge is low.

  • Gordon Messmer wrote:
    As I noted in my original repost, that it needs to be on a UPS, and to repeat myself, untaring a 107MB tarfile on an xfs filesystem mounted over NFS, it was ->seven minutes<-, 100% repeatable, while after we added nobarrier and remounted it, it was about ->40 seconds<-. That's *hugely* significant. mark

  • I apologize if I’m being dense here, but I’m more confused on this data flow now. Your use of “correct” and “no” seems to be inconsistent with your explanation. Sorry!

    At any rate, what I was looking at was seeing if there was any way to simplify this process, and cut NFS out of the picture. If you need only to push these files around, what about rsync?

    …right, but I’m pretty sure rsync should be installed on the server;
    I believe it’s default in all except the “minimal” setup profiles. Either way, it’s trivial to install, as I don’t think it has any dependencies. You can download the rsync rpm from mirror.CentOS.org, then scp it to the server, then install via yum. And Python is definitely installed (requirement for yum) and Perl is probably installed as well, so with rsync plus some basic Perl/Python scripting you can create your own mover script.

    Actually, rsync may not even be necessary, scp may be sufficient for your purposes. And scp should definitely be installed.

    From what you’ve described so far, with what appears to be a relatively simple config, C6 or C7 “shouldn’t” matter. However, under the hood, C6 and C7 are quite different.

    This script runs on C7 or C6?

    OK, depending on the RAID scheme and how it’s implemented, there could be disk timeouts causing things to hang.

    How about iostat? Also, good old “dmesg” can suggest if the system with the failed drive is causing timeouts to occur.

    OK, but ethtool should also say something like:

    … Speed: 1000Mb/s Duplex: Full

    For a 1gbps network. If Duplex is reported as “half”, then that is definitely a problem. Using netperf is further confirmation of whether or not your network is functioning as expected.

    If you want to report on a day in the past, you have to pass the file argument, something like this:

    sar -A -f /var/log/sa/sa23 -s 07:00:00 -e 08:00:00

    That would show you yesterday’s data between 7am and 8am. The files in /var/log/sa/saXX are the files that correspond to the day. By default, XX will be the day of the month.

  • I though you were asking “Are you doing: A: moving files on the same NFS filesystem , or B: moving them across filesystems?

    And I replied, “Correct I am doing A, no I am not doing B.”

    The script moves the files from /mnt/nfs-server/dir1/file to
    /mnt/nfs-server/dir2/file.

    It’s not just moving files around. The files are read, and their contents are loaded into a MySQL database.

    This site is not in any way connected to the internet, and you cannot bring in any computers, phones, or media of any kind. There is a process to get machines or files in, but it is onerous and time consuming. This system was set up and configured off site and then brought on site.

    To run the script on the C7 NFS server instead of the C6 NFS client many python libs will have to installed. I do have someone off site working on setting up a local yum repo with what I need, and then we are going to see if we can zip and email the repo and get it on site. But none of us are sys admins and we don’t really know what we’re doing so we may not succeed and it may take longer then I will be here in Japan (I am scheduled to leave Saturday).

    C6

    Yes, that’s why when I found about the disk failure I wanted to hold off doing anything until the disk gets replaced. But as that is not happening until Wenesday afternoon I think I want to try Mark’s nobarrier conifg option today.

    Nothing in dmesg or /var/log/messages about the failed disk at all. I
    only saw that when I got on the Integrated Management Module console. But the logs only go back to Sep 21 and the disk failed on Sep 3. The logs only have the NFS errors, no other errors.

    No it outputs just the one line:

    Link detected: yes

    OK, Thanks.

  • On what server does the MySQL database live?

    But clearly you have a means to log in to both the C6 and C7 servers, right? Otherwise, how would be able to see these errors, check top/sar/free/iostat/etc?

    And if you are logging in to both of these boxes, I assume you are doing so via ssh?

    Or are you actually physically sitting in front of these machines?

    If you have SSH access to these machines, then you can trivially copy files to/from them. If SSH is installed and working, then scp should also be installed and working. Even if you don’t have scp, you can use tar over SSH to the same effect. It’s ugly, but doable, and there are examples online for how to do it.

    Also: you made a couple comments about these machines, it looks like the C7 box (FTP server + NFS server) is running bare metal (i.e. not a virtual machine). The C6 instance (NFS client) is virtualized. What host is the C6 instance?

    Is the C6 instance running under the C7 instance? I.e., are both machines on the same physical hardware? If that is true, then your
    “network” (at least the one between C7 and C6) is basically virtual, and to have issues like this on the same physical box is certainly indicative of a mis-configuration.

    Right, but my point is you can write your own custom script(s) to copy files from C7 to C6 (based on rsync or ssh), do the processing on C6
    (DB loading, whatever other processing), then move back to C7 if necessary. You said yourself you are a programmer not a sysadmin, so change the nature of the problem from a sysadmin problem to a programming problem.

    I’m certain I’m missing something, but the fundamental architecture doesn’t make sense to me given what I understand of the process flow.

    Were you able to run some basic network testing tools between the C6
    and C7 machines? I’m interested specifically in netperf, which does round trip packet testing, both TCP and UDP. I would look for packet drops with UDP, and/or major performance outliers with TCP, and/or any kind of timeouts with either protocol.

    How is name resolution working on both machines? Do you address machines by hostname (e.g., “my_c6_server”), or explicitly by IP
    address? Are you using DNS or are the IPs hard-coded in /etc/hosts?

    To me it still “smells” like a networking issue…

    -Matt

  • Ah. I see that now. Still, may I suggest that whenever we recommend remedies that eliminate reliability measures, such as mounting with
    “nobarrier”, we also repeat caveats so that users who find these conversations in search results later don’t miss them? I think that’s important to note that the system should be on a UPS, *and* that it has been verified that the system will perform an orderly shut-down before the UPS loses charge. “nobarrier” shouldn’t be used without performing such a test.

  • Another alternative idea: you probably won’t be comfortable with this, but check out systemd-nspawn. There are lots of examples online, and even I wrote about how I use it:
    http://raw-sewage.net/articles/fedora-under-CentOS/

    This is unfortunately another “sysadmin” solution to your problem. nspawn is the successor to chroot, if you are at all familiar with that. It’s kinda-sorta like running a system-within-a-system, but much more lightweight. The “slave” systems share the running kernel with the “master” system. (I could say the “guest” and “host”
    systems, but those are virtual machine terms, and this is not a virtual machine.) For your particular case, the main benefit is that you can natively share filesystems, rather than use NFS to share files.

    So, it’s clear you have network capability between the C6 and C7
    systems. And surely you must have SSH installed on both systems. Therefore, you can transfer files between C6 and C7. So here’s a way you can use systemd-nspawn to get around trying to install all the extra libs you need on C7:

    1. On the C7 machine, create a systemd-nspawn container. This container will “run” C6.
    2. You can source everything you need from the running C6 system directly. Heck, if you have enough disk space on the C7 system, you could just replicate the whole C6 tree to a sub-directory on C7.
    3. When you configure the C6 nspawn container, make sure you pass through the directory structure with these FTP’ed files. Basically you are substituting systemd-nspawn’s bind/filesystem pass-through mechanism in place of NFS.

    With that setup, you can “probably” run all the C6 native stuff under C7. This isn’t guaranteed to work, e.g. if your C6 programs require hooks into the kernel, it could fail, because now you’re running on a different kernel… but if you only use userspace libraries, you’ll probably be OK. But I was actually able to get HandBrake, compiled for bleeding-edge Ubuntu, to work within a C7 nspawn container.

    That probably trades one bit of complexity (NFS) for another
    (systemd-nspawn). But just throwing it out there if you’re completely stuck.

  • The C6 host, same one that the script runs on. We can of course access the MySQL server from the C7 host, assuming the needed packages are there.

    The machines are on a local network. I access them with putty from a windows machine, but I have to be at the site to do that.

    Correct.

    Yes, the C6 instance is running on the C7 machine. What could be mis-configured? What would I check to find out?

    Yes, that is potential solution I had not thought of. The issue with this is that we have the same system installed at many, many sites, and they all work fine. It is only this site that is having an issue. We really do not want to have different SW running at just this one site. Running the script on the C7 host is a change, but at least it will be the same software as every place else.

    netperf is not installed.

    Everything is by ip address.

  • So that means when you are offsite there is no way to access either machine? Does anyone have a means to access these machines from offsite?

    OK, so these two machines are actually the same physical hardware, correct?

    Do you know, is the networking between the two machines “soft”, as in done locally on the machine (typically through NAT or briding)? Or is it “hard”, in that you have a dedicated NIC for the host and a separate dedicated NIC for the guest, and actual cables going out of each interface and connected to a switch/hub/router? I would expect the former…

    If it truly is a “soft” network between the machines, then that is more evidence of a configuration error. Now, unfortunately, with what to look for: I have virtually no experience setting up C6 guests on a C7 host; at least not enough to help you troubleshoot the issue. But in general, you should be able to hit up a web search and look for howtos and other documents on setting up networking between a C7 host and its guests. That will allow you to (1) understand how it’s currently setup, (2) verify if there is any misconfig, and (3) correct or change if needed.

    IIRC, you said this is the only C7 instance? That would mean it is already not the same as every other site. It may be conceptually the same, but “under the hood”, there are a tremendous number of changes between C6 and C7. Effectively every single package is different, from the kernel all the way to trivial userspace tools.

    Again, if you can use putty (which is ssh) to access these systems, you implicitly have the ability to upload files (i.e. packages) to the systems. A simple tool like netperf should have few (if any)
    dependencies, so you don’t have to mess with mirroring the whole CentOS repo. Just grab the netperf rpm file from wherever, then use scp (I believe it’s called pscp when part of the Putty package) to copy to your servers, yum install and start testing.

  • Yes.

    I don’t know, but would also guess the former.

    Yes, of course it’s different at that level. But I was talking about our application software and set up. It is that that I want to keep consistent across deployments.

    Again, no machine on the internal network that my 2 CentOS hosts are on are connected to the internet. I have no way to download anything., There is an onerous and protracted process to get files into the internal network and I will see if I can get netperf in.

  • Right, but do you have physical access to those machines? Do you have physical access to the machine which on which you use PuTTY to connect to those machines? If yes to either question, then you can use another system (that does have Internet access) to download the files you want, put them on a USB drive (or burn to a CD, etc), and bring the USB/CD to the C6/C7/PuTTY machines.

    There’s almost always a technical way to get files on to (or out of) a system. :) Now, your company might have *policies* that forbid skirting around the technical measures that are in place.

    Here’s another way you might be able to test network connectivity between C6 and C7 without installing new tools: see if both machines have “nc” (netcat) installed. I’ve seen this tool referred to as “the swiss army knife of network testing tools”, and that is indeed an apt description. So if you have that installed, you can hit up the web for various examples of its use. It’s designed to be easily scripted, so you can write your own tests, and in theory implement something similar to netperf.

    OK, I just thought of another “poor man’s” way to at least do some sanity testing between C6 and C7: scp. First generate a huge file. General rule of thumb is at least 2x the amount of RAM in the C7 host. You could create a tarball of /usr, for example (e.g. “tar czvf
    /tmp/bigfile.tar.gz /usr” assuming your /tmp partition is big enough to hold this). Then, first do this: “time scp /tmp/bigfile.tar.gz localhost:/tmp/bigfile_copy.tar.gz”. This will literally make a copy of that big file, but will route through most of of the network stack. Make a note of how long it took. And also be sure your /tmp partition is big enough for two copies of that big file.

    Now, repeat that, but instead of copying to localhost, copy to the C6
    box. Something like: “time scp /tmp/bigfile.tar.gz :/tmp/”. Does the time reported differ greatly from when you copied to localhost? I would expect them to be reasonably close.
    (And this is another reason why you want a fairly large file, so the transfer time is dominated by actual file transfer, rather than the overhead.)

    Lastly, do the reverse test: log in to the C6 box, and copy the file back to C7, e.g. “time scp /tmp/bigfile.tar.gz :/tmp/bigfile_copy2.tar.gz”. Again, the time should be approximately the same for all three transfers. If either or both of the latter two copies take dramatically longer than the first, then there’s a good chance something is askew with the network config between C6 and C7.

    Oh… all this time I’ve been jumping to fancy tests. Have you tried the simplest form of testing, that is, doing by hand what your scripts do automatically? In other words, simply try copying files between C6
    and C7 using the existing NFS config? Can you manually trigger the errors/timeouts you initially posted? Is it when copying lots of small files? Or when you copy a single huge file? Any kind of file copying “profile” you can determine that consistently triggers the error? That could be another clue.

    Good luck!

  • I am sorry, I am stepping into the conversation late and may not fully understand all aspects of the situation but I wonder if it may make sense to set up a server process on the NFS server machine that simply listens for incoming requests to perform a file copy and then does so as requested
    – only locally. If files in question are large – which I suspect they may be, given the timeouts becoming an issue – that may resolve the issue and help speed things up at the same time.

    Cheers,

    Boris.

  • Finally got to add nobarrier (I’ll skip why it took so long), and it looks like this just caused the problem to morph a bit.

    On the C7 NFS server, besides having 50 external machines ftp-ing files to it, we run 2 jobs: 1 that moves files around (called image_mover) and one that changes perms on some files (called chmod_job).

    And on the C6 NFS client, besides the job that was hanging (called the importer), we also run a another job (called ftp_job) that ftps files to the C6 machine. The ftp_job had never hung before, but now the importer that used to hang has not (yet) hung, and the ftp_job that had not hung before now is hanging.

    But the system messages are different.

    On the C7 server there is a series of messages of the form ‘task blocked for >120 seconds’ with a stack trace. There is one for each of the following:

    nfsd, chmod_job, kworker, pure_ftpd, image_mover

    In each of the stack traces they are blocked on either nfs_write or nfs_flush

    And on the C6 client there is a similar blocked message for the ftp job, blocked on nfs_flush, then the bad sequence number message I had seen before, and at that point the ftp_job hung.

  • This site is locked down like no other I have ever seen. You cannot bring anything into the site – no computers, no media, no phone. You have to empty your pockets and go through an airport type naked body scan.

    This is my client’s client, and even if I could circumvent their policy I would not do that. They have a zero tolerance policy and if you are caught violating it you are banned for life from the company. And that would not make my client happy.

    These are all good debugging techniques, and I have tried some of them, but I think the issue is load related. There are 50 external machines ftp-ing to the C7 server, 24/7, thousands of files a day. And on the C6 client the script that processes them is running continuously. It will sometimes run for 7 hours then hang, but it has run for as long as 3 days before hanging. I have never been able to reproduce the errors/hanging situation manually.

    And again, this is only at this site. We have the same software deployed at 10 different sites all doing the same thing, and it all works fine at all of those.

  • Well I spoke too soon. The importer (the one that was initially hanging that I came here to fix) hung up after running 20 hours. There were no NFS errors or messages on neither the client nor the server. When I restarted it, it hung after 1 minute, Restarted it again and it hung after 20 seconds. After that when I restarted it it hung immediately. Still no NFS errors or messages. I tried running the process on the server and it worked fine. So I have to believe this is related to nobarrier. Tomorrow I will try removing that setting, but I
    am no closer to solving this and I have to leave Japan Saturday :-(

    The bad disk still has not been replaced – that is supposed to happen tomorrow, but I won’t have enough time after that to draw any conclusions.

  • Are any of these systems using jumbo frames? Check the MTU in the output of “ip link show” on every system, server and client. If any device doesn’t match the MTU of all of the others, that might cause the problem you’re describing. And if they all match, but they’re larger than 1500, a switch that doesn’t support jumbo frames would also cause the problem you’re describing.

  • I’ve seen behavior like that with disks that are on their way out… basically the system wants to read a block of data, and the disk doesn’t read it successfully, so it keeps trying. The kind of disk, what kind of controller it’s behind, raid level, and various other settings can all impact this phenomenon, and also how much detail you can see about it. You already know you have one bad disk, so that’s kind of an open wound that may or may not be contributing to your bigger, unsolved problem.

    So that makes me think, you can also do some basic disk benchmarking. iozone and bonnie++ are nice, but I’m guessing they’re not installed and you don’t have a means to install them. But you can use “dd” to do some basic benchmarking, and that’s all but guaranteed to be installed. Similar to network benchmarking, you can do something like:
    time dd if=/dev/zero of=/tmp/testfile.dat bs=1G count%6

    That will generate a 256 GB file. Adjust “bs” and “count” to whatever makes sense. General rule of thumb is you want the target file to be at least 2x the amount of RAM in the system to avoid cache effects from skewing your results. Bigger is even better if you have the space, as it increases the odds of hitting the “bad” part of the disk
    (if indeed that’s the source of your problem).

    Do that on C6, C7, and if you can a similar machine as a “control”
    box, it would be ideal. Again, we’re looking for outliers, hang-ups, timeouts, etc.

    +1 to Gordon’s suggestion to sanity check MTU sizes.

    Another random possibility… By somewhat funny coincidence, we have some servers in Japan as well, and were recently banging our heads against the wall with some weird networking issues. The remote hands we had helping us (none of our staff was on site) claimed one or more fiber cables were dusty, enough that it was affecting light levels. They cleaned the cables and the problems went away. Anyway, if you have access to the switches, you should be able to check that light levels are within spec.

    If you have the ability to take these systems offline temporarily, you can also run “fsck” (file system check) on the C6 and C7 file systems. IIRC, ext4 can do a very basic kind of check on a mounted filesystem. But a deeper/more comprehensive scan requires the FS to be unmounted. Not sure what the rules are for xfs. But C6 uses ext4 by default so you could probably at least run the basic check on that without taking the system offline.

  • –MtAmqLNQQ6qjB1C40tdmRnQjO08ucNlux Content-Type: text/plain; charset=windows-1252
    Content-Transfer-Encoding: quoted-printable

    Don’t bother with fsck on XFS filesystems. From the man page
    [fsck.xfs(8)]: “XFS is a journaling filesystem and performs recovery at mount(8) time if necessary, so fsck.xfs simply exits with a zero exit status”. If you need a deeper examination use xfs_repair(8) and note that: “the filesystem to be repaired must be unmounted, otherwise, the resulting filesystem may be inconsistent or corrupt” (from the man page).

    –MtAmqLNQQ6qjB1C40tdmRnQjO08ucNlux

  • Matt Garman wrote:


    I just had a truly unpleasant thought, speaking of disks. Years ago, we tried some WD Green drives in our servers, and that was a disaster. In somewhere between days and weeks, the drives would go offline. I finally found out what happened: consumer-grade drives are intended for desktops, and the TLER – how long the drive keeps trying to read or write to a sector before giving up, marking the sector bad, and going somewhere else
    – is two *minutes*. Our servers were expecting the TLER to be 7 *seconds*
    or under. Any chance the client cheaped out with any of the drives?

    mark

  • Just replaced the disk but I am leaving tomorrow so it was decided that we will run the process on the C7 server, at least for now. I
    will probably have to come back here early next year and revisit this. We are thinking of building a new system back in NY and shipping it here and swapping them out.

    No switches – it’s an internal, virtual network between the server and the virtualized client.

    The systems were rebooted 2 days ago and fsck was run at boot time, and was clean.

    Thanks for all your help with trying to solve this. It was a very frustrating time for me – I was here for 10 days and did not really discover anything about the problem. Hopefully running the process on the server will keep it going and keep the customer happy.

    I will update this thread when/if I revisit it.