Debugging a Slow SSH Session Start // Jack Alder

I’ve experienced very slow session loads when connecting to a server over SSH. Most recently, today, I became frustrated with a local server that suddenly began stalling for about 30 seconds before my session prompt would return.

Typically, this is some kind of DNS timeout, because it’s always DNS, right?

Here are the steps I took to troubleshoot this annoyance.

1. Is it only over SSH, or does it affect `su` sessions as well?

I wanted to determine if the issue was something in my bash profile. The verbose output of my ssh connections surfaced that the stall occurred after the connection and auth was accepted.

Command ssh -vvv user@host was helpful.

So, I initiated another session with su - user, which was also slow. Interestingly, su user did not have the 30 second timeout issue. This might suggest profile problems but we can’t be certain yet.

2. If it’s slow for local `su`, let’s strace it

Using strace for testing su is much simpler than trying to strace sshd.

Command sudo strace -tt -o /tmp/strace.txt -f su -l user was helpful.

Reviewing the /tmp/strace.txt, we find the time skip and work backwards from there.

The time skip:

1447631 16:42:06.364102 ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=24, tv_nsec=999856000}, NULL, 8) = 0 (Timeout)
1447631 16:42:31.388143 ioctl(1, TCGETS, 0x7fffab230c30) = -1 ENOTTY (Inappropriate ioctl for device)

Spoiler, the ioctl message is a red herring, ignore that.

A few lines down we see:

1447630 16:42:31.388896 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 1447631
1447630 16:42:31.388914 rt_sigaction(SIGINT, {sa_handler=0x56264518aef0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f783e34c9a0}, {sa_handler=0x562645166510, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f783e34c9a0}, 8) = 0
1447630 16:42:31.388941 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
1447630 16:42:31.388958 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1447631, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---
1447630 16:42:31.388970 wait4(-1, 0x7ffdbd5e3510, WNOHANG, NULL) = -1 ECHILD (No child processes)
1447630 16:42:31.388986 rt_sigreturn({mask=[]}) = 0
1447630 16:42:31.389041 exit_group(1)   = ?
1447630 16:42:31.389132 +++ exited with 1 +++
1447628 16:42:31.389170 <... read resumed>"", 4096) = 0
1447628 16:42:31.389229 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1447630, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---
1447628 16:42:31.389295 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 1447630
1447628 16:42:31.389317 wait4(-1, 0x7ffdbd5e2990, WNOHANG, NULL) = -1 ECHILD (No child processes)

These lines are more interesting and seem relevant, what was wait4 waiting for?

Let’s walk backwards search for wait4:

1447630 16:42:06.354198 wait4(-1,  <unfinished ...>
...(truncated output)
1447631 16:42:06.354726 execve("/usr/bin/hostnamectl", ["/usr/bin/hostnamectl", "--transient"], 0x562646768350 /* 9 vars */) = 0

Oh, there’s a call to hostnamectl --transient opened during this window.

Let’s try it ourselves:

$ hostnamectl --transient
Could not get property: Connection timed out

Checking with systemd:

$ sudo systemctl status systemd-hostnamed
[sudo] password for user: 
○ systemd-hostnamed.service - Hostname Service
     Loaded: loaded (/usr/lib/systemd/system/systemd-hostnamed.service; static)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: inactive (dead)
       Docs: man:systemd-hostnamed.service(8)
             man:hostname(5)
             man:machine-info(5)
             man:org.freedesktop.hostname1(5)

A restart is all it took to resolve the issue, for about a 30 seconds:

user@host:~$ sudo systemctl status systemd-hostnamed
○ systemd-hostnamed.service - Hostname Service
     Loaded: loaded (/usr/lib/systemd/system/systemd-hostnamed.service; static)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: inactive (dead)
       Docs: man:systemd-hostnamed.service(8)
             man:hostname(5)
             man:machine-info(5)
             man:org.freedesktop.hostname1(5)

Apr 18 21:01:28 host systemd[1]: Starting systemd-hostnamed.service - Hostname Service...
Apr 18 21:01:28 host systemd[1]: Started systemd-hostnamed.service - Hostname Service.
Apr 18 21:02:02 host systemd[1]: systemd-hostnamed.service: Deactivated successfully.
user@host:~$ hostnamectl
Failed to query system properties: Connection timed out

Some light research suggested that this is by design and I need to look further upstream at dbus services.

The systemctl status dbus did contain an error:

ERROR service_watch_jobs @ ../src/launch/service.c +297: Transport endpoint is not connected

I restarted sudo systemctl restart dbus and the issue was NOT resolved.

The hostnamectl call returned results immediately. Unfortunately something else is now lurking in the session start.

I restarted sudo systemctl restart systemd-logind and with a fresh dbus and systemd-logind, the issue was finally gone.

Thanks for following along on this one.

1. Is it only over SSH, or does it affect su sessions as well?

2. If it’s slow for local su, let’s strace it

1. Is it only over SSH, or does it affect `su` sessions as well?

2. If it’s slow for local `su`, let’s strace it