SP metadata cache keeps growing

Peter Schober peter.schober at univie.ac.at
Thu Feb 27 14:21:43 EST 2020


* Cantor, Scott <cantor.2 at osu.edu> [2020-02-26 22:43]:
> My guess is systemd is actively killing the process perhaps? The old
> init.d wait didn't really have that effect.

Yup, that's what's happening. On a clean CentOS 8 test system:

# systemctl start shibd
Job for shibd.service failed because a timeout was exceeded.
See "systemctl status shibd.service" and "journalctl -xe" for details.

# ps aux | fgrep shibd
shibd     2136  0.0 88.0 1227988 461504 ?      Rsl  18:15   5:37 /usr/sbin/shibd -f -F

# systemctl status shibd
* shibd.service - Shibboleth Service Provider Daemon
   Loaded: loaded (/usr/lib/systemd/system/shibd.service; disabled; vendor preset: disabled)
   Active: deactivating (stop-sigterm) (Result: timeout)
     Docs: https://wiki.shibboleth.net/confluence/display/SP3/Home
 Main PID: 2136 (shibd)
    Tasks: 4 (limit: 26213)
   Memory: 450.9M
   CGroup: /system.slice/shibd.service
           └─2136 /usr/sbin/shibd -f -F

Feb 27 18:15:29 centos8lxc systemd[1]: shibd.service: Service RestartSec=30s expired, scheduling restart.
Feb 27 18:15:29 centos8lxc systemd[1]: shibd.service: Scheduled restart job, restart counter is at 2.
Feb 27 18:15:29 centos8lxc systemd[1]: Stopped Shibboleth Service Provider Daemon.
Feb 27 18:15:29 centos8lxc systemd[1]: Starting Shibboleth Service Provider Daemon...
Feb 27 18:20:29 centos8lxc systemd[1]: shibd.service: Start operation timed out. Terminating.

Note the "Terminating" in the last log line and "restart counter" in
the second. Not sure that the maximum number of restart attempts is
but I stopped my tests at 6...

# systemctl show shibd | fgrep -i restarts
NRestarts=6

Even though it can sometimes seen to be done/killed for good ...

# systemctl status shibd
* shibd.service - Shibboleth Service Provider Daemon
   Loaded: loaded (/usr/lib/systemd/system/shibd.service; disabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: timeout) since Thu 2020-02-27 18:28:00 UTC; 5s ago
     Docs: https://wiki.shibboleth.net/confluence/display/SP3/Home
  Process: 2154 ExecStart=/usr/sbin/shibd -f -F (code=killed, signal=KILL)
 Main PID: 2154 (code=killed, signal=KILL)

Feb 27 18:28:00 centos8lxc systemd[1]: shibd.service: Main process exited, code=killed, status=9/KILL
Feb 27 18:28:00 centos8lxc systemd[1]: shibd.service: Failed with result 'timeout'.
Feb 27 18:28:00 centos8lxc systemd[1]: Failed to start Shibboleth Service Provider Daemon.

This can seemingly go on forever, ultimately filling the file system
with identical copies of the cached metadata.

# ls -l /var/cache/shibboleth/
-rw-r--r-- 1 shibd shibd 54404390 Feb 27 18:28 aconet-metadata.xml.3ac6
-rw-r--r-- 1 shibd shibd 54404390 Feb 27 18:09 aconet-metadata.xml.5711
-rw-r--r-- 1 shibd shibd 54404390 Feb 27 18:22 aconet-metadata.xml.6cb5
-rw-r--r-- 1 shibd shibd 54404390 Feb 27 18:15 aconet-metadata.xml.71f2
-rw-r--r-- 1 shibd shibd 54404390 Feb 27 18:02 aconet-metadata.xml.a620


As for requiring a manual override by the deployer:
I note that shibd uses systemd's notify protocol:

# systemctl show shibd | fgrep -i type
Type=notify

And that the systemd.service docs[1] mention that services can use that
channel to extend the timeout as long as needed by re-sending messages:

> If a service of Type=notify sends "EXTEND_TIMEOUT_USEC=…", this may
> cause the start time to be extended beyond TimeoutStartSec=. The first
> receipt of this message must occur before TimeoutStartSec= is
> exceeded, and once the start time has exended beyond TimeoutStartSec=,
> the service manager will allow the service to continue to start,
> provided the service repeats "EXTEND_TIMEOUT_USEC=…" within the
> interval specified until the service startup status is finished by
> "READY=1". (see sd_notify(3)).

Is that something you'd consider adding while xerces takes its time?

In the meantime I've amended the systemd wik page a bit:
https://wiki.shibboleth.net/confluence/display/SP3/LinuxSystemd

Thanks,
-peter

[1] https://www.freedesktop.org/software/systemd/man/systemd.service.html#TimeoutStartSec=


More information about the users mailing list