NFS/RDMA Connection Closed

Home » CentOS » NFS/RDMA Connection Closed
CentOS 1 Comment

Hi, we are having a problem with NFS using RDMA protocol over our FDR10
Infiniband network. I previously wrote to NFS mailing list about this, so you may find our discussion there. I have taken some load off the server which was using NFS for backups, and converted it to use SSH, but we are still having critical problems with NFS clients losing connection to the server, causing the clients to hang and needing a reboot. I
wanted to check in here before filing a bug with CentOS.

Our setup is a cluster with one head node (NFS server) and 9 compute nodes (NFS clients). All the machines are running CentOS 6.9
2.6.32-696.30.1.el6.x86_64 and using the “Inbox”/CentOS RDMA
implementation/drivers (not Mellanox OFED). (We also have other NFS
clients but they are using 1GbE for NFS connection and, while they will still hang with messages like “NFS server not responding, retrying” or
“timed out”, they will eventually recover and don’t need a reboot.)

On the server (which is named pac) I will see messages like this:
Jul 30 18:19:38 pac kernel: svcrdma: failed to send reply chunks, rc=-5
Jul 30 18:19:38 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:03:05 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:09:06 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:16:09 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:23:31 pac kernel: svcrdma: Error -107 posting RDMA_READ
Jul 31 15:53:55 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 16:09:19 pac kernel: svcrdma: failed to send reply chunks, rc=-5
Jul 31 16:09:19 pac kernel: svcrdma: failed to send reply chunks, rc=-5

Previously I had also seen messages like “Jul 11 21:09:56 pac kernel:
nfsd: peername failed (err 107)!” however have not seen that in this latest hangup.

And on the clients (named n001-n009) I will see messages like this:
Jul 30 18:17:26 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810674024c0 (stale): WR flushed Jul 30 18:17:26 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff88106638a640 (stale): WR flushed Jul 30 18:19:26 n001 kernel: nfs: server 10.10.11.100 not responding, still trying Jul 30 18:19:36 n001 kernel: nfs: server 10.10.10.100 not responding, timed out Jul 30 18:19:38 n001 kernel: rpcrdma: connection to 10.10.11.100:20049
on mlx4_0, memreg 5 slots 32 ird 16
Jul 30 18:19:38 n001 kernel: nfs: server 10.10.11.100 OK
Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810671f02c0 (stale): WR flushed Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677bda40 (stale): WR flushed Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677bd940 (stale): WR flushed Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810671f0240 (stale): WR flushed Jul 31 14:43:35 n001 kernel: rpcrdma: connection to 10.10.11.100:20049
on mlx4_0, memreg 5 slots 32 ird 16
Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881065133140 (stale): WR flushed Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810666e3f00 (stale): WR flushed Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881063ea0dc0 (stale): WR flushed Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677bdb40 (stale): WR flushed Jul 31 15:03:05 n001 kernel: rpcrdma: connection to 10.10.11.100:20049
on mlx4_0, memreg 5 slots 32 ird 16
Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881060e59d40 (stale): WR flushed Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677efac0 (stale): WR flushed Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff88106638a640 (stale): WR flushed Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810671f03c0 (stale): WR flushed Jul 31 15:09:06 n001 kernel: rpcrdma: connection to 10.10.11.100:20049
on mlx4_0, memreg 5 slots 32 ird 16
Jul 31 15:16:09 n001 kernel: rpcrdma: connection to 10.10.11.100:20049
closed (-103)
Jul 31 15:53:32 n001 kernel: nfs: server 10.10.10.100 not responding, timed out Jul 31 16:08:56 n001 kernel: nfs: server 10.10.10.100 not responding, timed out

Jul 30 18:17:26 n002 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881064461500 (stale): WR flushed Jul 30 18:17:26 n002 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810604b2600 (stale): WR flushed Jul 30 18:19:26 n002 kernel: nfs: server 10.10.11.100 not responding, still trying Jul 30 18:19:38 n002 kernel: rpcrdma: connection to 10.10.11.100:20049
on mlx4_0, memreg 5 slots 32 ird 16
Jul 30 18:19:38 n002 kernel: nfs: server 10.10.11.100 OK
Jul 31 14:43:35 n002 kernel: rpcrdma: connection to 10.10.11.100:20049
closed (-103)
Jul 31 16:08:56 n002 kernel: nfs: server 10.10.10.100 not responding, timed out

Similar messages show up on the other clients n003-n009. After these messages on the clients, their load will continually go up (viewable through Ganglia) (I would guess since they are waiting for NFS mount to re-appear). They aren’t reachable any longer through SSH and neither can root log in through console via IPMI web applet (just hangs after entering password, may get to prompt eventually but system load is so high), they need to be rebooted through IPMI interface.

Here is /etc/fstab on the server, UUID

One thought on - NFS/RDMA Connection Closed

  • Hi I also forgot to add the following information which was discussed on NFS mailing list with Chuck Lever, leading us to believe there is a software bug in the kernel, not necessarily a server overload.

    On the NFS server, we also mount some other NFS shares from other NFS
    servers, over 1GbE:
    150.x.x.116:/wing on /wing type nfs (rw,addr0.x.x.116)
    10.10.10.201:/opt/ftproot on /opt/ftproot type nfs
    (rw,vers=4,addr.10.10.201,clientaddr.10.10.100)
    150.x.x.202:/archive on /archive type nfs
    (rw,vers=4,addr0.x.x.202,clientaddr8.x.x.2)

    This hangup/bug seems to occur when we are reading/writing to these other shares from the NFS server and the NFS server is also busy processing our work from the cluster using the RDMA exports. There used to be two other NFS mounts, which were used to send/write backups to, and were scheduled every night at 8PM. I noticed the RDMA errors from my original post were all showing up shortly after 8PM. So we decided to get rid of these NFS mounts and convert the backup to transfer via SSH instead. The RDMA errors stopped happening after 8PM when the backup ran, but now the errors are still showing up, when we are reading/writing to the other NFS mounts above that we still need.

    It seems we should be able to use these different mounts and exports without issue, leading us to believe there is a software bug somewhere.

    Are there any other suggested solutions to this problem? Perhaps some system, network and/or filesystem tuning? Any comments on adding the
    “inode64,nobarrier” XFS mount options? Any extra information I can gather to help with a bug report? Debug info or whatnot?

    Thanks