We have a “compute cluster” of about 100 machines that do a read-only NFS mount to a big NAS filer (a NetApp FAS6280). The jobs running on these boxes are analysis/simulation jobs that constantly read data off the NAS.
We recently upgraded all these machines from CentOS 5.7 to CentOS 6.5. We did a “piecemeal” upgrade, usually upgrading five or so machines at a time, every few days. We noticed improved performance on the CentOS
6 boxes. But as the number of CentOS 6 boxes increased, we actually saw performance on the CentOS 5 boxes decrease. By the time we had only a few CentOS 5 boxes left, they were performing so badly as to be effectively worthless.
What we observed in parallel to this upgrade process was that the read latency on our NetApp device skyrocketed. This in turn caused all compute jobs to actually run slower, as it seemed to move the bottleneck from the client servers’ OS to the NetApp. This is somewhat counter-intuitive: CentOS 6 performs faster, but actually results in net performance loss because it creates a bottleneck on our centralized storage.
All indications are that CentOS 6 seems to be much more “aggressive”
in how it does NFS reads. And likewise, CentOS 5 was very “polite”, to the point that it basically got starved out by the introduction of the 6.5 boxes.
What I’m looking for is a “deep dive” list of changes to the NFS
implementation between CentOS 5 and CentOS 6. Or maybe this is due to a change in the TCP stack? Or maybe the scheduler? We’ve tried a lot of sysctl tcp tunings, various nfs mount options, anything that’s obviously different between 5 and 6… But so far we’ve been unable to find the “smoking gun” that causes the obvious behavior change between the two OS versions.
Just hoping that maybe someone else out there has seen something like this, or can point me to some detailed documentation that might clue me in on what to look for next.