> technical > System Management

systemd and containers and chaos

We were fated to pretend


Operating system users and groups are a tool to minimize access. This protects me from myself. I try to make them as granular as possible.

useradd -m dojo
passwd dojo

There is a concept of a “system” user which has no login shell. The only purpose is to run daemons. This actually fits most of my use cases, but I like being able to hop in to a user on the shell, so rarely use it.


I have tried a lot of tools to manage the ever growing chaos of my homeservers. Something like ansible still feels like too much overhead for me. I have settled on just manually creating system users (e.g bitcoin for running a full node) and keeping the complexity of the app localized as much as possible (e.g. building an executable in the home directory). This utilizes the OS’s builtin user/group permissions model and makes it pretty easy to follow the principle of least privilege (e.g. only services that need access to Tor have access).

I use systemd to automate starting/stopping/triggering services and managing the service dependencies.


Systemd can trigger services like cron. Easiest way to do this is create a .timer file for a service and then enable/start the timer. Do not enable the service, that confuses systemd.


Description=Run BOS node report



Description=Run report daily


sudo systemctl enable report.timer
sudo systemctl start report.timer

Path triggers

Another helpful trigger is when a path changes. Think backing up a file everytime it is modified. Like timers, only enable the .path and not the service itself.


Description=Backup LND channels on any changes


sudo systemctl enable report.path
sudo systemctl start report.path


Turns out the standard systemd examples are not very secure. Systemd provides a tool to see which services are in trouble: systemd-analyze security and then take a deeper dive per-service with systemd-analyze security <service>.

My standard set of hardening flags (which I’ll try to expand as I learn more about them):

# Hardening

I think RuntimeDirectory could be used to auto create and destroy a directory for a process, but requires coordination with the process to write/read to that location.


I like to be emailed on service failures.

First requirement is an easy to call email script. I am jacking this straight from the Arch wiki with some slight mods for my email setup.


# Send alert to my email

/usr/bin/msmtp nick@yonson.dev <<ERRMAIL
Subject: $1 failure
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=UTF-8

$(systemctl status --full "$1")


Description=status email for %i

ExecStart=/usr/local/bin/systemd-email %i

A template unit service with limited permissions to fire the email. Edit the service you want emails for and add OnFailure=status-email@%n.service to the [Unit] (not [Service])section. %n passes the unit’s name to the template.

OnFailure= A space-separated list of one or more units that are activated when this unit enters the “failed” state. A service unit using Restart= enters the failed state only after the start limits are reached.

If using Restart=, a service will only enter a failed state if StartLimitIntervalSec= and StartLimitBurst= are set in the [Service] section.

Units which are started more than burst times within an interval time span are not permitted to start any more.

Do not have to enable the template service (e.g. status-email@lnd).


Impedance mismatch

Systemd’s model works great for executables which follow the standard fork/exec process.

A growing portion of the services I am running do not publish an executable, but rather a container. The docker client/server (docker daemon) model takes on some of the same responsibilites as systemd (e.g. restarting a service when it fails). There is not a clear line in the sand.

Since I haven’t gone 100% containers for everything though, I need to declare all the service dependencies in systemd (including containerized services). This works just OK. The main shortcoming is that by default docker client commands are fire-and-forget. This sucks because no info is passed back to systemd, it doesn’t know if the service actualy started up correctly and can’t pass that along to dependencies.

Docker commands must always be run by root (or a user in the docker group, which is pretty much the same thing from a security perspective) so we can’t utilize systemd’s ability to execute services as a system user (e.g. bitcoin).

// example systemd service file wrapping docker
Description=Matrix go neb bot

// docker's `--attach` option forwards signals and stdout/stderr helping pass some info back to systemd
ExecStart=docker start -a 80a975d2f9baff82a27edc389bfe2f5a74e597560acc63fb3dfe4a3df07c8797
ExecStop=docker stop 80a975d2f9baff82a27edc389bfe2f5a74e597560acc63fb3dfe4a3df07c8797


Lastly, there is a lot of complexity around user and group permissions between the host and containers. This is most apparent when sharing data between a host and container through a bind mount. in a perfect world from a permissions standpoint, the host’s user and groups would be mirrored into the container and respected there. However, most containers default to running as UID 0 a.k.a. root (note: there is a difference between the uid who starts the container and the uid which interannly runs the container, but they can be the same).

Here is where the complexity jumps again: enter user namespaces. User namespaces are the future fix to help the permissions complexity, but you have to either go all in on them or follow the old best practices, they don’t really play nice together.

User and groups old best practices

I have never gone “all in” on containers, but I think this is the decision tree for best practices:

if (all in on containers) {
    use *volumes* which are fully managed by container framework and hide complexities
} else if (mixing host and containers) {
    use bind mounts and manually ensure UID/GID match between host and containers
} else {
    // no state shared between host and container
    don't even worry bout this stuff

Case 2 can be complicated by containers which specify a USER in their Dockerfile. On one hand, this is safer than running as root by default. On the other, this makes them less portable since all downstream systems will have to match this UID in order to work with bind mounts.

I am attempting to bridge these mismatches by switching from docker to podman.


I am not completely sold on Red Hat’s new container ecosystem of tools (podman, buildah, skopeo…not sure if I love or hate these names yet), but podman has me sold on the fact that it uses the standard fork/exec model instead of client/server allowing it to “click in” to systemd with default settings. podman also runs in rootless mode allowing the OS user permission model to be used again (although I think there are still some complexities here).

At the time of me switching (2021-03-30) the arch wiki on podman listed three requirements to using podman.

  1. kernel.unprivileged_userns_clone kernal param needs to be set to 1
    • all my boxes had this set to 1 already so this was easy
  2. cgroups v2 needs to be enabled
    • this can be checked by running ls /sys/fs/cgroup and seeing alot of cgroup.* entries
    • systemd v248 defaults to v2, and some proof that the world revolves around me, v248 was released 36 minutes before I wrote this down so we are probably good for the future
  3. Set subuid and subgid
    • these are for the user namespaces, I saw a warning from podman when I didn’t set it

Note that the values for each user must be unique and without any overlap. If there is an overlap, there is a potential for a user to use another’s namespace and they could corrupt it.

[njohnson@gemini ~]$ podman run docker.io/hello-world
ERRO[0000] cannot find UID/GID for user njohnson: open /etc/subuid: no such file or directory - check rootless mode in man pages.
WARN[0000] using rootless singl


It is now easier to use the crun instead of runc OCI runtime. I don’t really know what any of these words mean. But I ran into an Error: container_linux.go:380: starting container process caused: error adding seccomp filter rule for syscall bdflush: permission denied: OCI permission denied and this was the fix. Apparently the “lighter” crun is the future. Set it in /etc/containers/containers.conf with runtime = "crun".

User namespace permissions

With user namespaces enabled, the container’s default root UID is mapped by the host OS’s UID which started the process (when rootless). Nifty tables here show the differences.

I belive this is the best of both worlds, but it does require that images don’t specify a USER (old best practice to not run as root). If USER is used, this will map to one of the sub UIDs of the host’s user namespace instead of the host user (which is root in the container).

A bit more is need to get group permissions to work correctly. Here is Red Hat’s group permissoins deep dive.

POSIX vs User namespaces

There are interesting security tradeoffs when it comes to user namespaces, more on that in these here and here

Positive: adds a 2nd layer of defense, if an attacker gains root in a container, they are still unprivileged (mapped to non-root) on the host Negative: capabilties that a user does not have on the host they all of a sudden can get in a limited fashion in the contianer


The container’s root user does have a bit of extra privileges, but nothing that could affect the host. There is a setting to run as a user that matches the host, --userns=keep-id, which would give up these extra privileges. This might just be cosmetic to not see uid 0 in the container…

Running a command

Running podman commands root-less requires systemd login variables to be set correctly. In the past I have had success using a login shell with su like su - lightning (notice the dash), but even that doesn’t seem to hook in with the logind/pam stuff and env vars like $XDG_RUNTIME_DIR are not set. The replacement for su is `machinectl shell –uid=lightning.

For interactive processes (like a shell), you must use -i -t together in order to allocate a tty for the container process. -i -t is often written -it as you’ll see in later examples. Specifying -t is forbidden when the client is receiving its standard input from a pipe, as in:

# attaching a volume and running a command expecting output
podman run -it -v $HOME/.bos:/home/node/.bos docker.io/alexbosworth/balanceofsatoshis --version

Long running

If a user logs out, all processes of that user are killed. This might not be ideal if you have long running processes (like a web server) that you want to keep running. Systemd’s logind has a “linger” settting to allow this, but to be honest, I am not quite sure of all the side effects yet.

loginctl enable-linger lightning


Another container aspect that is extra complicated because I haven’t gone full containers.

  1. Use the host network
  1. Expose host as in the container
--network slirp4netns:allow_host_loopback=true