C6 Server Responding Extremely Slow On SSH Interactive

Home » CentOS » C6 Server Responding Extremely Slow On SSH Interactive
CentOS 17 Comments

I have a C6 server acting as a kvm-host.

When connecting with SSH the console is extremely slow and hangs for minutes at a time. Connecting to this server is not the problem.

If I use: SSH root@host “whatever” I got immediate response even when interactive consoles opened with SSH are hanging.

Linux […] 2.6.32-504.3.3.el6.x86_64 #1 SMP Wed Dec 17 01:55:02 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

total used free shared buffers cached
Mem: 47 35 11 0 0 0
-/+ buffers/cache: 35 11
Swap: 7 0 7

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg-lv_root
50G 6,4G 41G 14% /
tmpfs 24G 0 24G 0% /dev/shm
/dev/sda1 477M 123M 329M 28% /boot

13:33:34 up 1 day, 18:30, 2 users, load average: 3.39, 2.53, 2.36

(it’s an 8-core)

Nothing particular in log/messages.

The vm’s are running normally and they are not showing the same behaviour.

Can anybody give me a pointer?

Thanks Patrick

17 thoughts on - C6 Server Responding Extremely Slow On SSH Interactive

  • Sorry, is it hanging during the session or while attempting to establish a new one? If this last, it may be dns and SSH -v may help. The former is weird, I don’t think I ever saw it.

    Marcelo

  • Op 28-01-15 om 17:20 schreef Marcelo Ricardo Leitner:
    Marcelo,

    It hangs during the session. Once I’m logged in and beginning to type it displays 3-5 chars and then hangs for up to 15 minutes, a few more chars, wait, and so on. Checked my resolv.conf; added ‘options single-request-reopen’ though I don’t know if that is helping.

    Yes it is weird; even more that individual commands sent with SSH gives immediate respons.

    Thanks Patrick

  • Hi Patrick have you ever tried to find out on which side the hanger is: on the client’s or on the server’s, using tcpumg or the like?
    That migth help a bit further on, that might.

    suomi

  • Op 28-01-15 om 20:17 schreef Gordon Messmer:

    ARPING 192.168.1.15 from 0.0.0.0 br0
    Unicast reply from 192.168.1.15 [AC:16:2D:72:67:D4] 0.723ms Sent 1 probes (1 broadcast(s))
    Received 1 response(s)

    Thanks anyway Patrick

  • I’m not sure what you mean by “thanks anyway”.

    You got a response. There’s an IPv4 conflict on your network. That’s why you’re seeing those delays. If there’s no conflict, you should see
    0 responses.

  • Op 29-01-15 om 00:00 schreef Gordon Messmer:

    Gordon,

    I’m sorry, I misunderstood you (and arping -D)
    This was the result of arping on another host; I thought I should see 2 responses in case of an ip conflict.

    Arping on the troublesome server gives 0 responses.

    I just tried with a physical console on that server and there I got the same unresponsive behaviour. Does this rule out network related problems?

    Mark (m.roth) suggested the vms eating up the video bus. (2 vms with an Oracle database)
    But I’m not sure how I could test that.

    Patrick

  • Op 28-01-15 om 17:51 schreef anax:
    I’m not sure what you mean with tcpumg. But after testing with a physical console I’m experiencing the same problem. So I guess its the server.

    Thanks

  • Well, that’s a different story, then. :)

    I haven’t seen delays anywhere near that long before, even with heavy swapping. But I guess I’d look at that sort of thing first.

    Run “iostat -x 2” and see if your disks are being fully utilized during the pauses. Run “top” and see if there’s anything useful there. Check swap use with “free”. Try decreasing swappiness with “echo 10
    >/proc/sys/vm/swappiness”

  • Op 29-01-15 om 21:21 schreef Gordon Messmer:

    iostat random sample avg-cpu: %user %nice %system %iowait %steal %idle
    3,77 0,00 1,45 0,00 0,00 94,78

    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,50 0,00 11,00 0,00 136,00 12,36 0,00 0,00 0,00 0,00
    sdb 0,00 0,00 0,00 11,50 0,00 148,00 12,87 0,00 0,09 0,09 0,10
    sdc 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
    dm-0 0,00 0,00 0,00 4,00 0,00 32,00 8,00 0,00 0,00 0,00 0,00
    dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
    dm-2 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
    dm-3 0,00 0,00 0,00 11,50 0,00 148,00 12,87 0,00 0,13 0,13 0,15
    dm-4 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
    dm-5 0,00 0,00 0,00 7,50 0,00 104,00 13,87 0,00 0,07 0,07 0,05

    atop ATOP – 2015/01/30 10:18:14 ——— 10s elapsed PRC | sys 3.87s | user 14.93s | #proc 197 | #zombie 0 | #exit 0 |
    CPU | sys 30% | user 119% | irq 1% | idle 533% | wait 0% |
    cpu | sys 2% | user 21% | irq 0% | idle 56% | cpu000 w 0% |
    cpu | sys 3% | user 19% | irq 0% | idle 59% | cpu001 w 0% |
    cpu | sys 8% | user 15% | irq 0% | idle 62% | cpu003 w 0% |
    cpu | sys 3% | user 13% | irq 0% | idle 73% | cpu002 w 0% |
    cpu | sys 3% | user 14% | irq 0% | idle 70% | cpu006 w 0% |
    cpu | sys 4% | user 15% | irq 0% | idle 66% | cpu005 w 0% |
    cpu | sys 2% | user 11% | irq 0% | idle 77% | cpu007 w 0% |
    cpu | sys 5% | user 11% | irq 0% | idle 73% | cpu004 w 0% |
    CPL | avg1 1.92 | avg5 1.97 | avg15 1.61 | csw 229508 | intr 191786 |
    MEM | tot 47.1G | free 15.9G | cache 519.3M | buff 109.3M | slab 353.3M |
    SWP | tot 7.8G | free 7.3G | | vmcom 31.8G | vmlim 31.3G |
    LVM | g_15k-lv_15k | busy 0% | read 1 | write 98 | avio 0.15 ms |
    LVM | to-lv_oracle | busy 0% | read 0 | write 66 | avio 0.06 ms |
    LVM | v_oracletest | busy 0% | read 0 | write 79 | avio 0.05 ms |
    LVM | uito-lv_root | busy 0% | read 0 | write 1 | avio 3.00 ms |
    DSK | sdb | busy 0% | read 1 | write 98 | avio 0.16 ms |
    DSK | sda | busy 0% | read 0 | write 146 | avio 0.08 ms |
    NET | transport | tcpi 12 | tcpo 12 | udpi 0 | udpo 0 |
    NET | network | ipi 13 | ipo 12 | ipfrw 0 | deliv 12 |
    NET | vnet0 8% | pcki 2273 | pcko 2581 | si 850 Kbps | so 458 Kbps |
    NET | vnet1 4% | pcki 2186 | pcko 2075 | si 391 Kbps | so 422 Kbps |
    NET | eth0 0% | pcki 1330 | pcko 1432 | si 159 Kbps | so 537 Kbps |
    NET | br0 —- | pcki 43 | pcko 22 | si 1 Kbps | so 4 Kbps |

    PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD
    1960 2.37s 9.23s 0K 0K 8K 2520K — – S 101% qemu-kvm
    1990 0.69s 5.65s 0K 0K 0K 1196K — – S 55% qemu-kvm
    1975 0.50s 0.00s 0K 0K 0K 0K — – S 4% kvm-pit-wq
    2009 0.20s 0.00s 0K 0K 0K 0K — – S 2% kvm-pit-wq
    23321 0.05s 0.02s 0K 0K 0K 0K — – R 1% atop
    18384 0.05s 0.01s 0K 0K 0K 0K — – S 1% atop
    1719 0.00s 0.01s 0K 0K 0K 0K — – S 0% hpasmlited
    1746 0.00s 0.01s 0K 0K 0K 0K — – S 0% hp-asrd
    35 0.01s 0.00s 0K 0K 0K 0K — – D 0% events/0
    10707 0.00s 0.00s 0K 0K 0K 0K — – S 0% arping
    10740 0.00s 0.00s 0K 0K 0K 0K — – S 0% arping
    58 0.00s 0.00s 0K 0K 0K 0K — – S 0% kblockd/0
    18425 0.00s 0.00s 0K 0K 0K 0K — – S 0% flush-253:0

    free
    total used free shared buffers cached Mem: 48218 31895 16323 0 108 519
    -/+ buffers/cache: 31267 16951
    Swap: 7951 476 7475

    But I had the same pauses when free gave zero swap.

    If swap is the problem: would it matter if a command is run with SSH (ssh @ “command”) or in a shell?

    When running atop in a shell I observed pauses between screen updates longer than 10 seconds but atop displayed the time as “10 seconds later”. So drifting away in time. While a date command sent a the same time gave the correct date.

    So it seems like the screens are buffered and are being displayed with a delay.

  • thats an unusually small amount of ‘cached’… I usually see the disk cache as 30-50% of the total memory. does this system not use much disk IO ?

  • Op 30-01-15 om 10:29 schreef John R Pierce:

    it’s a kvm-host with lvm, the vm’s all have there own lv’s (some on a different pv). Would that explain the small cache?

  • “Random” is difficult to evaluate. Is that representative? Are sda, sdb, and sdc typically less than 1% utilized? Or are there large utilization values right after a hang?

    Let’s assume it’s not, but I would say “no” to the question. I’d expect the same delays regardless, if the system were swapping heavily.

    That’s really weird.

    Does the time displayed by “atop” eventually catch up?

    Does the problem persist across reboots?

    Is this system running ntpd?

    Does the problem persist if you turn ntpd off and reboot?

  • Op 30-01-15 om 19:40 schreef Gordon Messmer:
    All the output was in the same scale and during a hang in an other shell. Not that I know. But I gave up :-)
    Alas, one of the vm’s is our production database. My next update/reboot window is next saturday. But I had the problem just before the last reboot (halfway january). But hadn’t closely monitored it afterwards. Before – in december – I never experienced it. But it’s a server I tend do leave alone, so I’m never very busy on a shell. yes I’ll check that next week.

  • Op 30-01-15 om 21:51 schreef Gordon Messmer:
    IIRC
    before the problem: kernel.x86_64 0:2.6.32-504.el6
    problem occured during kernel.x86_64 0:2.6.32-504.1.3.el6
    actual kernel.x86_64 0:2.6.32-504.3.3.el6

    But since there is already a new kernel waiting; I’m not sure what to do. I think I’ll first upgrade & test. If my maintenance window permits I’ll test downgrading (but 3 updates…)

    BTW I’ve got 3 other kvm-servers without this behavior (but they are completely different machines so not much to compare)