System Management

2023.09.03

systemd and containers and chaos

We were fated to pretend

Permissions

users

Operating system users and groups are a tool to minimize access. Follow the principle of least privilege to protect against attacks (including accidental friendly fire).

# useradd -m dojo
# passwd dojo

add a dojo user

$ cat /etc/passwd

list users on system

$ groups

list groups current user is in, or pass a username

# getent $GROUP

list members of a group

$ cat /etc/group

list all groups

There is a concept of a “system” user which has no login shell. The only purpose is to run daemons. This actually fits most of my use cases, but I like being able to hop in to a user on the shell, so rarely use it.

# userdel -r $USERNAME

remove user, r option cleans up home directory, removes matching group too

# groupdel $GROUP

remove group

Environment

There are a lot of ways to set environment variables: shell configs, PAM, systemd. It gets confusing too when vars are set, how to pass them from system process to user and vice versa.

Per-user is where it gets tricky.

I have used PAM before, but according to docs its on its way out.

Systemd is probably the safest bet, in ~/.config/environment.d/*.conf, but these are only loaded for services. Not shell programs (unless they are somehow service based).

export $(/usr/lib/systemd/user-environment-generators/30-systemd-environment-d-generator)

Systemd has a little tool to export vars

systemd

I have tried a lot of tools to manage the ever growing chaos of my homeservers. Something like ansible still feels like too much overhead for me. I have settled on just manually creating system users (e.g bitcoin for running a full node) and keeping the complexity of the app localized as much as possible (e.g. building an executable in the home directory). This utilizes the OS’s builtin user/group permissions model and makes it pretty easy to follow the principle of least privilege (e.g. only services that need access to Tor have access).

I use systemd to automate starting/stopping/triggering services and managing the service dependencies.

Personal units

Sometimes I need to make a change to a unit, but don’t want it to be blown away by pacman on the next update. Systemd has a tool to do just this. Use systemd-delta to keep track of changes on system.

systemctl edit <unit>

timers

Systemd can trigger services like cron. Easiest way to do this is create a .timer file for a service and then enable/start the timer. Do not enable the service, that confuses systemd.

[Unit]
Description=Run BOS node report

[Service]
User=lightning
Group=lightning
ExecStart=/home/lightning/balanceofsatoshis/daily-report.sh

report.service

[Unit]
Description=Run report daily

[Timer]
OnCalendar=daily

[Install]
WantedBy=timers.target

report.timer

sudo systemctl enable report.timer
sudo systemctl start report.timer

enable the timer which triggers the service

path triggers

Another helpful trigger is when a path changes. Think backing up a file everytime it is modified. Like timers, only enable the .path and not the service itself.

[Unit]
Description=Backup LND channels on any changes

[Path]
PathModified=/data/lnd/backup/channel.backup

[Install]
WantedBy=multi-user.target

backup.path

sudo systemctl enable report.path
sudo systemctl start report.path

enable the path which triggers the service

sandboxing

Turns out the standard systemd examples are not very secure. Systemd provides a tool to see which services are in trouble: systemd-analyze security and then take a deeper dive per-service with systemd-analyze security <service>.

My standard set of hardening flags (which I’ll try to expand as I learn more about them):

# Hardening
PrivateTmp=true
PrivateDevices=true
ProtectSystem=strict
NoNewPrivileges=true

I think RuntimeDirectory could be used to auto create and destroy a directory for a process, but requires coordination with the process to write/read to that location.

systemd email

First requirement is an easy to call email script. I am jacking this straight from the Arch wiki with some slight mods for my email setup.

/usr/local/bin/systemd-email

#!/bin/sh
#
# Send alert to my email

/usr/bin/msmtp --read-recipients <<ERRMAIL
To: nick+gemini@yonson.dev
Subject: $1 failure
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=UTF-8

$(systemctl status --full "$1")
ERRMAIL

/etc/systemd/system/status-email@.service

[Unit]
Description=status email for %i

[Service]
Type=oneshot
ExecStart=/usr/local/bin/systemd-email %i
User=nobody
Group=systemd-journal

A template unit service with limited permissions to fire the email. Edit the service you want emails for and add OnFailure=status-email@%n.service to the [Unit] (not [Service])section. %n passes the unit’s name to the template.

If using Restart=, a service will only enter a failed state if StartLimitIntervalSec= and StartLimitBurst= are set in the [Service] section.

Do not have to enable the template service (e.g. status-email@lnd).

containers with systemd – impedance mismatch

Systemd’s model works great for executables which follow the standard fork/exec process.

A growing portion of the services I am running do not publish an executable, but rather a container. The docker client/server (docker daemon) model takes on some of the same responsibilites as systemd (e.g. restarting a service when it fails). There is not a clear line in the sand.

Since I haven’t gone 100% containers for everything though, I need to declare all the service dependencies in systemd (including containerized services). This works just OK. The main shortcoming is that by default docker client commands are fire-and-forget. This sucks because no info is passed back to systemd, it doesn’t know if the service actualy started up correctly and can’t pass that along to dependencies.

Docker commands must always be run by root (or a user in the docker group, which is pretty much the same thing from a security perspective) so we can’t utilize systemd’s ability to execute services as a system user (e.g. bitcoin).

// example systemd service file wrapping docker
[Unit]
Description=Matrix go neb bot
Requires=docker.service
After=docker.service

// docker's `--attach` option forwards signals and stdout/stderr helping pass some info back to systemd
[Service]
ExecStart=docker start -a 80a975d2f9baff82a27edc389bfe2f5a74e597560acc63fb3dfe4a3df07c8797
ExecStop=docker stop 80a975d2f9baff82a27edc389bfe2f5a74e597560acc63fb3dfe4a3df07c8797

[Install]
WantedBy=multi-user.target

Lastly, there is a lot of complexity around user and group permissions between the host and containers. This is most apparent when sharing data between a host and container through a bind mount. in a perfect world from a permissions standpoint, the host’s user and groups would be mirrored into the container and respected there. However, most containers default to running as UID 0 a.k.a. root (note: there is a difference between the uid who starts the container and the uid which interannly runs the container, but they can be the same).

Here is where the complexity jumps again: enter user namespaces. User namespaces are the future fix to help the permissions complexity, but you have to either go all in on them or follow the old best practices, they don’t really play nice together.

user and groups old best practices

I have never gone “all in” on containers, but I think this is the decision tree for best practices:

if (all in on containers) {
    use *volumes* which are fully managed by container framework and hide complexities
} else if (mixing host and containers) {
    use bind mounts and manually ensure UID/GID match between host and containers
} else {
    // no state shared between host and container
    don't even worry bout this stuff
}

Case 2 can be complicated by containers which specify a USER in their Dockerfile. On one hand, this is safer than running as root by default. On the other, this makes them less portable since all downstream systems will have to match this UID in order to work with bind mounts.

I am attempting to bridge these mismatches by switching from docker to podman.

File System

$ df -h

disk free info at a high level

$ du -sh

disk usage, summarized and human readable of PWD

# du -ax / | sort -rn | head -20

track down heavy files and directories

secure backup

I have dealt with quite a few object storage providers professionally. For my day to day home server needs though, all of these were too enterprise. For my projects I use rsync.net.

After setting up an account, rsync.net gives you a host to run unix commands on.

A script to encrypt a file with pgp and sync it to my rsync.net host. No need for the pgp private keys to be on the box, the public key can be used to encrypt the file. If the file is needed in the future the private key will be needed.

My standard backup script includes an email for failure notifications.

backup.sh

#!/bin/bash
#
# securely backup a file to rsync.net

RECIPIENT=nick@yonson.dev
FILE=/data/lnd/backup/channel.backup

# --yes overwrites existing
if gpg --yes --recipient $RECIPIENT --encrypt "$FILE" &&
  # namespacing backups by host
  rsync --relative $FILE.gpg change@change.rsync.net:$HOSTNAME/ ; then
    echo "Successfully backed up $FILE"
  else
    echo "Error backing up $FILE"
    echo "Subject: Error rsyncing\nUnable to backup $FILE on $HOSTNAME" | msmtp nick@yonson.dev
fi

Package Management

[options]
IgnorePkg = docker-compose

lock package in pacman.conf

$ pacman -Qqe

list explictly installed

AUR

$ gpg --keyserver pgp.mit.edu --recv-keys 118759E83439A9B1

git key from a keyserver

$ curl https://raw.githubusercontent.com/lightningnetwork/lnd/master/scripts/keys/bhandras.asc --output key.asc
$ gpg --import key.asc 

download and import public key file to keyring

PKGBUILD

aurutils

$ aur repo -l

list packages

patch

Containers

I am not completely sold on Red Hat’s new container ecosystem of tools (podman, buildah, skopeo…not sure if I love or hate these names yet), but podman has me sold on the fact that it uses the standard fork/exec model instead of client/server allowing it to “click in” to systemd with default settings. podman also runs in rootless mode allowing the OS user permission model to be used again (although I think there are still some complexities here).

At the time of me switching (2021-03-30) the arch wiki on podman listed three requirements to using podman.

  1. kernel.unprivileged_userns_clone kernal param needs to be set to 1
    • all my boxes had this set to 1 already so this was easy
  2. cgroups v2 needs to be enabled
    • this can be checked by running ls /sys/fs/cgroup and seeing alot of cgroup.* entries
    • systemd v248 defaults to v2, and some proof that the world revolves around me, v248 was released 36 minutes before I wrote this down so we are probably good for the future
  3. Set subuid and subgid
    • these are for the user namespaces, I saw a warning from podman when I didn’t set it

Note that the values for each user must be unique and without any overlap. If there is an overlap, there is a potential for a user to use another’s namespace and they could corrupt it.

[njohnson@gemini ~]$ podman run docker.io/hello-world
ERRO[0000] cannot find UID/GID for user njohnson: open /etc/subuid: no such file or directory - check rootless mode in man pages.
WARN[0000] using rootless singl

It is now easier to use the crun instead of runc OCI runtime. I don’t really know what any of these words mean. But I ran into an Error: container_linux.go:380: starting container process caused: error adding seccomp filter rule for syscall bdflush: permission denied: OCI permission denied and this was the fix. Apparently the “lighter” crun is the future. Set it in /etc/containers/containers.conf with runtime = "crun".

namespace permissions

With user namespaces enabled, the container’s default root UID is mapped by the host OS’s UID which started the process (when rootless). Nifty tables here show the differences.

I believe this is the best of both worlds, but it does require that images do not specify a USER (old best practice to not run as root). If USER is used, this will map to one of the sub UIDs of the host’s user namespace instead of the host user (which is root in the container).

A bit more is need to get group permissions to work correctly. Here is Red Hat’s group permissions deep dive.

$ podman top --latest huser user

podman also gives us a really cool sub-command called top which lets us map the user on the container host to the user in the running container.

$ sysctl kernel.unprivileged_userns_clone

kernel permissions are checked

POSIX vs User namespaces

There are interesting security tradeoffs when it comes to user namespaces, more on that in these here and here

Positive: adds a 2nd layer of defense, if an attacker gains root in a container, they are still unprivileged (mapped to non-root) on the host Negative: capabilties that a user does not have on the host they all of a sudden can get in a limited fashion in the contianer

–userns=keep-id

The container’s root user does have a bit of extra privileges, but nothing that could affect the host. There is a setting to run as a user that matches the host, --userns=keep-id, which would give up these extra privileges. This might just be cosmetic to not see uid 0 in the container…

systemd integration

Running podman commands rootless requires systemd login variables to be set correctly. In the past I have had success using a login shell with su like su - lightning (notice the dash), but even that doesn’t seem to hook in with the logind/pam stuff and env vars like $XDG_RUNTIME_DIR are not set. The replacement for su is machinectl shell --uid=lightning.

For interactive processes (like a shell), you must use -i -t together in order to allocate a tty for the container process. -i -t is often written -it as you’ll see in later examples. Specifying -t is forbidden when the client is receiving its standard input from a pipe, as in:

# attaching a volume and running a command expecting output
podman run -it -v $HOME/.bos:/home/node/.bos docker.io/alexbosworth/balanceofsatoshis --version

If a user logs out, all processes of that user are killed. This might not be ideal if you have long running processes (like a web server) that you want to keep running. Systemd’s logind has a “linger” settting to allow this, but to be honest, I am not quite sure of all the side effects yet.

loginctl enable-linger lightning

build

podman build -t clients:latest -f Containerfile .

build with tag, file, and context

processes

registries

/etc/containers/registries.conf to look at docker.io automatically.

unqualified-search-registries = ["docker.io"]

publish

ghcr.io/nyonson/raiju:latest

ghcr is githubs

containerfile (dockerfile)

The VOLUME instruction creates a mount point with the specified name and marks it as holding externally mounted volumes from native host or other containers.

work dir

The WORKDIR instruction sets the working directory for any RUN, CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the Dockerfile.

COPY and ADD

COPY [--chown=<user>:<group>] <src>... <dest>
COPY [--chown=<user>:<group>] ["<src>",... "<dest>"]
ADD [--chown=<user>:<group>] <src>... <dest>
ADD [--chown=<user>:<group>] ["<src>",... "<dest>"]

Multiple resources may be specified but the paths of files and directories will be interpreted as relative to the source of the context of the build.

copy from other images

Optionally COPY accepts a flag –from= that can be used to set the source location to a previous build stage (created with FROM .. AS ) that will be used instead of a build context sent by the user. In case a build stage with a specified name can’t be found an image with the same name is attempted to be used instead.

You can have as many stages (e.g. FROM … AS …) as you want. The last one is the one defining the image which will be the template for the docker container.

When using multi-stage builds, you are not limited to copying from stages you created earlier in your Dockerfile. You can use the COPY –from instruction to copy from a separate image, either using the local image name, a tag available locally or on a Docker registry, or a tag ID. The Docker client pulls the image if necessary and copies the artifact from there.

copy or add?

Entrypoint and CMD

EXPOSE

The EXPOSE instruction informs Docker that the container listens on the specified network ports at runtime. EXPOSE does not make the ports of the container accessible to the host.

The EXPOSE instruction does not actually publish the port. It functions as a type of documentation between the person who builds the image and the person who runs the container, about which ports are intended to be published.

RUN

dockerignore

golang container builds

create, start, and run

podman create --name container-1 ubuntu

create and name a container, also when you can supply CMD

networking

Another container aspect that is extra complicated because I haven’t gone full containers.

  1. Use the host network
--network=host
  1. Expose host as 10.0.2.2 in the container

Using Compose as an abstraction layer, containers can be created on the same network on talk to each other. DNS requires a plugin. Install podman-dnsname from arch repo.

Ports can be EXPOSED or PUBLISHED. Expose means the port can be reached by other container services. Published means the port is mapped to a host port.

compose

The yaml spec can be backed by podman instead of docker daemon.

$ systemctl --user enable --now podman.socket

creates a socket for the user

$ ls -la /run/user/1000/podman/podman.sock
srw-rw---- 1 njohnson njohnson 0 Aug  9 05:34 /run/user/1000/podman/podman.sock=

double checking

$ curl --unix-socket /run/user/1000/podman/podman.sock http://localhost/images/json

curl to ping the unix socket

The default docker socket is /var/run/docker.sock and it usually is owned by root, but in the docker group. Fun fact, /var/run is symlinked to /run on Arch.

podman system service is how to run podman in a daemon API-esque mode.

The podman-docker package provides a very light docker shim over the podman exec. The PKGDEST lead me to the Makefile of podman which contains the logic to override docker. /usr/lib/tmpfiles.d/podman-docker.conf contains the symlink definition. tmpfiles.d is a systemd service to create and maybe maintain volitile files.

PODMAN-DOCKER is necessary for docker-compose exec

down

Should use the offical down command (over ctrl-c) to ensure everything is cleaned up, else containers could be reused.

> docker-compose down
Stopping publisher ... done
Stopping bitcoind  ... done
Stopping consumer  ... done
Removing publisher ... done
Removing bitcoind  ... done
Removing consumer  ... done
Removing network lightning_lnnet

Volumes

Define in image (Dockerfile) or outside (docker-compose)?

Networks

ENV and ARG

The ENV instruction sets the environment variable to the value

Or using ARG, which is not persisted in the final image

When building a Docker image from the commandline, you can set ARG values using build-arg

When you try to set a variable which is not ARG mentioned in the Dockerfile, Docker will complain.

When building an image, the only thing you can provide are ARG values, as described above. You can’t provide values for ENV variables directly. However, both ARG and ENV can work together. You can use ARG to set the default values of ENV vars.

dependencies

troubleshooting

one-off exec

Sometimes you want to debug something on a running container.

podman exec c10c378bb306 cat /etc/hosts

one-off

podman exec -it fdea48095fe1 /bin/bash

shell process

Copy file from container to host

podman cp <containerId>:/file/path/within/container /host/path/target

new kernal

If getting weird errors, might be cause of new kernal installed.

ERRO[0000] 'overlay' is not supported over extfs at "/var/lib/containers/storage/overlay"

duplicate mount points

podman system prune

WARNING! This will remove:
	- all stopped containers
	- all networks not used by at least one container
	- all dangling images
	- all dangling build cache
podman volume prune

Nix