systemd and containers and chaos
We were fated to pretend
Permissions
users
Operating system users and groups are a tool to minimize access. Follow the principle of least privilege to protect against attacks (including accidental friendly fire).
# useradd -m dojo
# passwd dojo
add a dojo user
$ cat /etc/passwd
list users on system
$ groups
list groups current user is in, or pass a username
# getent $GROUP
list members of a group
$ cat /etc/group
list all groups
There is a concept of a “system” user which has no login shell. The only purpose is to run daemons. This actually fits most of my use cases, but I like being able to hop in to a user on the shell, so rarely use it.
# userdel -r $USERNAME
remove user, r option cleans up home directory, removes matching group too
# groupdel $GROUP
remove group
- An interesting blog post from Drew DeVault exposed me to how commands automatically run as a different user or group.
Environment
There are a lot of ways to set environment variables: shell configs, PAM, systemd. It gets confusing too when vars are set, how to pass them from system process to user and vice versa.
- arch linux docs
- print environment vars with
printenv
Per-user is where it gets tricky.
I have used PAM before, but according to docs its on its way out.
Systemd is probably the safest bet, in ~/.config/environment.d/*.conf
, but these are only loaded for services. Not shell programs (unless they are somehow service based).
- Note that some managers query the systemd user instance for the exported environment and inject this configuration into programs they start, using
systemctl show-environment
or the underlying D-Bus call.
export $(/usr/lib/systemd/user-environment-generators/30-systemd-environment-d-generator)
Systemd has a little tool to export vars
systemd
I have tried a lot of tools to manage the ever growing chaos of my homeservers. Something like ansible
still feels like too much overhead for me. I have settled on just manually creating system users (e.g bitcoin
for running a full node) and keeping the complexity of the app localized as much as possible (e.g. building an executable in the home directory). This utilizes the OS’s builtin user/group permissions model and makes it pretty easy to follow the principle of least privilege (e.g. only services that need access to Tor
have access).
I use systemd
to automate starting/stopping/triggering services and managing the service dependencies.
Personal units
- system level drop them in
/etc/systemd/system
- user level…haven’t needed to mess much with yet
Sometimes I need to make a change to a unit, but don’t want it to be blown away by pacman on the next update. Systemd has a tool to do just this. Use systemd-delta
to keep track of changes on system.
systemctl edit <unit>
- add power targets
timers
Systemd can trigger services like cron. Easiest way to do this is create a .timer
file for a service and then enable/start the timer. Do not enable the service, that confuses systemd.
[Unit]
Description=Run BOS node report
[Service]
User=lightning
Group=lightning
ExecStart=/home/lightning/balanceofsatoshis/daily-report.sh
report.service
[Unit]
Description=Run report daily
[Timer]
OnCalendar=daily
[Install]
WantedBy=timers.target
report.timer
sudo systemctl enable report.timer
sudo systemctl start report.timer
enable the timer which triggers the service
path triggers
Another helpful trigger is when a path changes. Think backing up a file everytime it is modified. Like timers, only enable the .path
and not the service itself.
[Unit]
Description=Backup LND channels on any changes
[Path]
PathModified=/data/lnd/backup/channel.backup
[Install]
WantedBy=multi-user.target
backup.path
sudo systemctl enable report.path
sudo systemctl start report.path
enable the path which triggers the service
sandboxing
Turns out the standard systemd examples are not very secure. Systemd provides a tool to see which services are in trouble: systemd-analyze security
and then take a deeper dive per-service with systemd-analyze security <service>
.
My standard set of hardening flags (which I’ll try to expand as I learn more about them):
# Hardening
PrivateTmp=true
PrivateDevices=true
ProtectSystem=strict
NoNewPrivileges=true
PrivateTmp
// Processes running with this flag would see a different and unique /tmp from the one users and other daemons sees or can access. Mitigates other programs reading tmp data.PrivateDevices
// Sets up a new/dev/
mount for the executed processes, useful to securely turn off physical device access by the executed process.ProtectSystem
// Makes common system directories read only, don’t need programs messing with/boot
NoNewPrivileges
// Prevents the service and related child processes from escalating privileges, seems like a reasonable default
I think RuntimeDirectory
could be used to auto create and destroy a directory for a process, but requires coordination with the process to write/read to that location.
systemd email
First requirement is an easy to call email script. I am jacking this straight from the Arch wiki with some slight mods for my email setup.
/usr/local/bin/systemd-email
#!/bin/sh
#
# Send alert to my email
/usr/bin/msmtp --read-recipients <<ERRMAIL
To: nick+gemini@yonson.dev
Subject: $1 failure
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=UTF-8
$(systemctl status --full "$1")
ERRMAIL
/etc/systemd/system/status-email@.service
[Unit]
Description=status email for %i
[Service]
Type=oneshot
ExecStart=/usr/local/bin/systemd-email %i
User=nobody
Group=systemd-journal
A template unit service with limited permissions to fire the email. Edit the service you want emails for and add OnFailure=status-email@%n.service
to the [Unit]
(not [Service]
)section. %n
passes the unit’s name to the template.
OnFailure=
A space-separated list of one or more units that are activated when this unit enters the “failed” state. A service unit using Restart= enters the failed state only after the start limits are reached.
If using Restart=
, a service will only enter a failed state if StartLimitIntervalSec=
and StartLimitBurst=
are set in the [Service]
section.
- Units which are started more than burst times within an interval time span are not permitted to start any more.
Do not have to enable
the template service (e.g. status-email@lnd).
containers with systemd – impedance mismatch
Systemd’s model works great for executables which follow the standard fork/exec
process.
A growing portion of the services I am running do not publish an executable, but rather a container. The docker client/server (docker daemon) model takes on some of the same responsibilites as systemd (e.g. restarting a service when it fails). There is not a clear line in the sand.
Since I haven’t gone 100% containers for everything though, I need to declare all the service dependencies in systemd (including containerized services). This works just OK. The main shortcoming is that by default docker client commands are fire-and-forget. This sucks because no info is passed back to systemd, it doesn’t know if the service actualy started up correctly and can’t pass that along to dependencies.
Docker commands must always be run by root (or a user in the docker
group, which is pretty much the same thing from a security perspective) so we can’t utilize systemd’s ability to execute services as a system user (e.g. bitcoin
).
// example systemd service file wrapping docker
[Unit]
Description=Matrix go neb bot
Requires=docker.service
After=docker.service
// docker's `--attach` option forwards signals and stdout/stderr helping pass some info back to systemd
[Service]
ExecStart=docker start -a 80a975d2f9baff82a27edc389bfe2f5a74e597560acc63fb3dfe4a3df07c8797
ExecStop=docker stop 80a975d2f9baff82a27edc389bfe2f5a74e597560acc63fb3dfe4a3df07c8797
[Install]
WantedBy=multi-user.target
Lastly, there is a lot of complexity around user and group permissions between the host and containers. This is most apparent when sharing data between a host and container through a bind mount. in a perfect world from a permissions standpoint, the host’s user and groups would be mirrored into the container and respected there. However, most containers default to running as UID 0 a.k.a. root (note: there is a difference between the uid who starts the container and the uid which interannly runs the container, but they can be the same).
Here is where the complexity jumps again: enter user namespaces. User namespaces are the future fix to help the permissions complexity, but you have to either go all in on them or follow the old best practices, they don’t really play nice together.
user and groups old best practices
I have never gone “all in” on containers, but I think this is the decision tree for best practices:
if (all in on containers) {
use *volumes* which are fully managed by container framework and hide complexities
} else if (mixing host and containers) {
use bind mounts and manually ensure UID/GID match between host and containers
} else {
// no state shared between host and container
don't even worry bout this stuff
}
Case 2 can be complicated by containers which specify a USER
in their Dockerfile. On one hand, this is safer than running as root by default. On the other, this makes them less portable since all downstream systems will have to match this UID in order to work with bind mounts.
I am attempting to bridge these mismatches by switching from docker
to podman
.
File System
$ df -h
disk free info at a high level
$ du -sh
disk usage, summarized and human readable of PWD
# du -ax / | sort -rn | head -20
track down heavy files and directories
secure backup
I have dealt with quite a few object storage providers professionally. For my day to day home server needs though, all of these were too enterprise. For my projects I use rsync.net.
After setting up an account, rsync.net gives you a host to run unix commands on.
A script to encrypt a file with pgp and sync it to my rsync.net host. No need for the pgp private keys to be on the box, the public key can be used to encrypt the file. If the file is needed in the future the private key will be needed.
My standard backup script includes an email for failure notifications.
backup.sh
#!/bin/bash
#
# securely backup a file to rsync.net
RECIPIENT=nick@yonson.dev
FILE=/data/lnd/backup/channel.backup
# --yes overwrites existing
if gpg --yes --recipient $RECIPIENT --encrypt "$FILE" &&
# namespacing backups by host
rsync --relative $FILE.gpg change@change.rsync.net:$HOSTNAME/ ; then
echo "Successfully backed up $FILE"
else
echo "Error backing up $FILE"
echo "Subject: Error rsyncing\nUnable to backup $FILE on $HOSTNAME" | msmtp nick@yonson.dev
fi
Package Management
[options]
IgnorePkg = docker-compose
lock package in pacman.conf
$ pacman -Qqe
list explictly installed
AUR
$ gpg --keyserver pgp.mit.edu --recv-keys 118759E83439A9B1
git key from a keyserver
$ curl https://raw.githubusercontent.com/lightningnetwork/lnd/master/scripts/keys/bhandras.asc --output key.asc
$ gpg --import key.asc
download and import public key file to keyring
PKGBUILD
-
- depends – an array of packages that must be installed for the software to build and run. Dependencies defined inside the package() function are only required to run the software.
- makedepends – only to build
-
good example: dendrite
aurutils
- scripts to manage a local ABS (Arch Build System) repository which pacman can use
- man pages contain examples for common tasks like removing stuff
aur-remove
$ aur repo -l
list packages
- have scrips run as
aur
user, but can run certain pacman commands as root without sudo in/etc/sudoers.d/10_aur
patch
- ABS docs on patches
- Want to apply a patch to an AUR package in a maintainable way
- Can edit repository source under
~/.cache/aurutils/sync/
- update source hashes with
updpkgsums
- build with
aur build -f -d custom
- change default merge strategy
- need to see if this breaks
aur fetch
or at least what is the defaultgit config pull.rebase true
– switch to rebase so the temp fix commit stays on top and no merge commits (probably a little dangerous)- hard to merge PKGBUILD hashes?
- update source hashes with
- Can edit repository source under
Containers
I am not completely sold on Red Hat’s new container ecosystem of tools (podman
, buildah
, skopeo
…not sure if I love or hate these names yet), but podman
has me sold on the fact that it uses the standard fork/exec model instead of client/server allowing it to “click in” to systemd with default settings. podman
also runs in rootless mode allowing the OS user permission model to be used again (although I think there are still some complexities here).
At the time of me switching (2021-03-30) the arch wiki on podman listed three requirements to using podman.
kernel.unprivileged_userns_clone
kernal param needs to be set to 1- all my boxes had this set to 1 already so this was easy
- cgroups v2 needs to be enabled
- this can be checked by running
ls /sys/fs/cgroup
and seeing alot ofcgroup.*
entries - systemd v248 defaults to v2, and some proof that the world revolves around me, v248 was released 36 minutes before I wrote this down so we are probably good for the future
- this can be checked by running
- Set subuid and subgid
- these are for the user namespaces, I saw a warning from
podman
when I didn’t set it
- these are for the user namespaces, I saw a warning from
Note that the values for each user must be unique and without any overlap. If there is an overlap, there is a potential for a user to use another’s namespace and they could corrupt it.
[njohnson@gemini ~]$ podman run docker.io/hello-world
ERRO[0000] cannot find UID/GID for user njohnson: open /etc/subuid: no such file or directory - check rootless mode in man pages.
WARN[0000] using rootless singl
It is now easier to use the crun
instead of runc
OCI runtime. I don’t really know what any of these words mean. But I ran into an Error: container_linux.go:380: starting container process caused: error adding seccomp filter rule for syscall bdflush: permission denied: OCI permission denied and this was the fix. Apparently the “lighter” crun
is the future. Set it in /etc/containers/containers.conf
with runtime = "crun"
.
namespace permissions
With user namespaces enabled, the container’s default root UID is mapped by the host OS’s UID which started the process (when rootless). Nifty tables here show the differences.
I believe this is the best of both worlds, but it does require that images do not specify a USER
(old best practice to not run as root). If USER
is used, this will map to one of the sub UIDs of the host’s user namespace instead of the host user (which is root
in the container).
A bit more is need to get group permissions to work correctly. Here is Red Hat’s group permissions deep dive.
$ podman top --latest huser user
podman also gives us a really cool sub-command called top which lets us map the user on the container host to the user in the running container.
$ sysctl kernel.unprivileged_userns_clone
kernel permissions are checked
POSIX vs User namespaces
There are interesting security tradeoffs when it comes to user namespaces, more on that in these here and here
- POSIX == users, groups, root escalation, capabilities
Positive: adds a 2nd layer of defense, if an attacker gains root in a container, they are still unprivileged (mapped to non-root) on the host Negative: capabilties that a user does not have on the host they all of a sudden can get in a limited fashion in the contianer
–userns=keep-id
The container’s root user does have a bit of extra privileges, but nothing that could affect the host. There is a setting to run as a user that matches the host, --userns=keep-id
, which would give up these extra privileges. This might just be cosmetic to not see uid 0 in the container…
- creates the same uid, so if the image already has the uid there could be some confusion
- doesn’t create a $HOME
systemd integration
- can create
container
files that podman converts to service units - run daemon-reload
- start generated service
- docs
Running podman
commands rootless requires systemd login variables to be set correctly. In the past I have had success using a login shell with su
like su - lightning
(notice the dash), but even that doesn’t seem to hook in with the logind/pam stuff and env vars like $XDG_RUNTIME_DIR
are not set. The replacement for su
is machinectl shell --uid=lightning
.
For interactive processes (like a shell), you must use -i -t together in order to allocate a tty for the container process. -i -t is often written -it as you’ll see in later examples. Specifying -t is forbidden when the client is receiving its standard input from a pipe, as in:
# attaching a volume and running a command expecting output
podman run -it -v $HOME/.bos:/home/node/.bos docker.io/alexbosworth/balanceofsatoshis --version
If a user logs out, all processes of that user are killed. This might not be ideal if you have long running processes (like a web server) that you want to keep running. Systemd’s logind has a “linger” settting to allow this, but to be honest, I am not quite sure of all the side effects yet.
loginctl enable-linger lightning
build
- image name often follows conventions based on the repository its stored in
- for GCR this looks like:
$REPO/$PROJECT/$NAME:$TAG
- for GCR this looks like:
podman build -t clients:latest -f Containerfile .
build with tag, file, and context
processes
- best practices
- cloud run command
- use ENTRYPOINT to wrap the main process
registries
/etc/containers/registries.conf
to look at docker.io automatically.
unqualified-search-registries = ["docker.io"]
publish
ghcr.io/nyonson/raiju:latest
ghcr is githubs
- auth with the container registry
pass raiju-access-token | docker login ghcr.io -u nyonson --password-stdin
- setup access control
containerfile (dockerfile)
The VOLUME instruction creates a mount point with the specified name and marks it as holding externally mounted volumes from native host or other containers.
work dir
The WORKDIR instruction sets the working directory for any RUN, CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the Dockerfile.
COPY and ADD
COPY [--chown=<user>:<group>] <src>... <dest>
COPY [--chown=<user>:<group>] ["<src>",... "<dest>"]
ADD [--chown=<user>:<group>] <src>... <dest>
ADD [--chown=<user>:<group>] ["<src>",... "<dest>"]
Multiple resources may be specified but the paths of files and directories will be interpreted as relative to the source of the context of the build.
copy from other images
Optionally COPY accepts a flag –from= that can be used to set the source location to a previous build stage (created with FROM .. AS ) that will be used instead of a build context sent by the user. In case a build stage with a specified name can’t be found an image with the same name is attempted to be used instead.
You can have as many stages (e.g. FROM … AS …) as you want. The last one is the one defining the image which will be the template for the docker container.
When using multi-stage builds, you are not limited to copying from stages you created earlier in your Dockerfile. You can use the COPY –from instruction to copy from a separate image, either using the local image name, a tag available locally or on a Docker registry, or a tag ID. The Docker client pulls the image if necessary and copies the artifact from there.
copy or add?
- According to the Dockerfile best practices guide, we should always prefer COPY over ADD unless we specifically need one of the two additional features of ADD (extract from URL or tarball)
Entrypoint and CMD
- CMD instruction allows you to set a default command, which will be executed only when you run container without specifying a command.
- ENTRYPOINT allows you to configure a container that will run as an executable.
- The difference is ENTRYPOINT command and parameters are not ignored when Docker container runs with command line parameters
- use of ENTRYPOINT sends a strong message that this container is only intended to run this one command
- Combining ENTRYPOINT and CMD allows you to specify the default executable for your image while also providing default arguments to that executable which may be overridden by the user
- exec form
ENTRYPOINT ["executable", "param1", "param2"]
- preferred, less surprises
- can be “extended” with CMD (with the CMD part overwriteable)
- shell form
ENTRYPOINT command param1 param2
- more surprises, like it always runs not matter the input
EXPOSE
The EXPOSE instruction informs Docker that the container listens on the specified network ports at runtime. EXPOSE does not make the ports of the container accessible to the host.
The EXPOSE instruction does not actually publish the port. It functions as a type of documentation between the person who builds the image and the person who runs the container, about which ports are intended to be published.
RUN
- each RUN line adds a layer to the image
dockerignore
- file used to describe items which should not be included in ADD or COPY
golang container builds
- Want to avoid re-downloading all dependencies every build
- make use of layer-caching
- Multi-stage helps with final image size, but not performance. That seems buildkit related.
- docs
create, start, and run
podman create --name container-1 ubuntu
create and name a container, also when you can supply CMD
- see container size with
ps -a --size
- start a container (needs to be created first)
- run is the combo, creates a new container and then starts it
networking
Another container aspect that is extra complicated because I haven’t gone full containers.
- Use the host network
--network=host
- this does not follow the principle of least privilege, the container network namespace is shared with the host
- Expose host as
10.0.2.2
in the container
- unclear how this magic works yet and/or if its better than option (1)
Using Compose as an abstraction layer, containers can be created on the same network on talk to each other. DNS requires a plugin. Install podman-dnsname
from arch repo.
Ports can be EXPOSED
or PUBLISHED
. Expose means the port can be reached by other container services. Published means the port is mapped to a host port.
compose
The yaml spec can be backed by podman instead of docker daemon.
- can setup a podman.socket and use unmodified docker-compose that talks to that socket but in this case you lose the process-model (ex. docker-compose build will send a possibly large context tarball to the daemon)
- which to use?
- the docker-compose script parses compose YAML and runs podman commands, but it doesn’t handle everything the same as docker-compose (like doesn’t replace containers by default on
up
) - docker-compose is also a one man project that appears to be losing ground
- I like the idea of it instead of relying on a client-server model, but its not robust at the moment
- the docker-compose script parses compose YAML and runs podman commands, but it doesn’t handle everything the same as docker-compose (like doesn’t replace containers by default on
$ systemctl --user enable --now podman.socket
creates a socket for the user
$ ls -la /run/user/1000/podman/podman.sock
srw-rw---- 1 njohnson njohnson 0 Aug 9 05:34 /run/user/1000/podman/podman.sock=
double checking
$ curl --unix-socket /run/user/1000/podman/podman.sock http://localhost/images/json
curl
to ping the unix socket
The default docker socket is /var/run/docker.sock
and it usually is owned by root, but in the docker group. Fun fact, /var/run
is symlinked to /run
on Arch.
podman system service
is how to run podman in a daemon API-esque mode.
The podman-docker
package provides a very light docker
shim over the podman exec. The PKGDEST lead me to the Makefile of podman which contains the logic to override docker. /usr/lib/tmpfiles.d/podman-docker.conf
contains the symlink definition. tmpfiles.d is a systemd service to create and maybe maintain volitile files.
PODMAN-DOCKER is necessary for docker-compose exec
down
Should use the offical down command (over ctrl-c) to ensure everything is cleaned up, else containers could be reused.
> docker-compose down
Stopping publisher ... done
Stopping bitcoind ... done
Stopping consumer ... done
Removing publisher ... done
Removing bitcoind ... done
Removing consumer ... done
Removing network lightning_lnnet
Volumes
Define in image (Dockerfile) or outside (docker-compose)?
- Lots of painpoints defining in image
- (when defined in image) unless you explicitly tell docker to remove volumes when you remove the container, these volumes remain, unlikely to ever be used again
- outside of an image has a lot of benefits: Not only can you define your volume, but you can give it a name, select a volume driver, or map a directory from the host
- The simplest way of making Docker data persistent is bind mounts, which literally bind a location on the host disk to a location on the container’s disk. These are simple to create and use, but are a little janky as you need to set up the directories and manage them yourself. Volumes are like virtual hard drives managed by Docker. Docker handles storing them on disk (usually in /var/lib/docker/volumes/), and gives them an easily memorable single name rather than a directory path. It’s easy to create and remove them using the Docker CLI.
- Volumes are helpful for saving data across restarts of your Docker containers.
- Bind mounts will mount a file or directory on to your container from your host machine, which you can then reference via its absolute path
Networks
expose
vsports
- ports: Either specify both ports (HOST:CONTAINER), or just the container port (a random host port will be chosen)
- expose: Expose ports without publishing them to the host machine - they’ll only be accessible to linked services. Only the internal port can be specified.
- In recent versions of Dockerfile, EXPOSE doesn’t have any operational impact anymore, it is just informative
- check if ports are bound on the local system with
ss -tulpn
ENV and ARG
The ENV instruction sets the environment variable to the value
Or using ARG, which is not persisted in the final image
- environment args can be passed from docker-compose to Dockerfile, but might be best to think of them as runtime vs. buildtime and not mix them
- Setting ARG and ENV values leaves traces in the Docker image. Don\u2019t use them for secrets
When building a Docker image from the commandline, you can set ARG values using build-arg
When you try to set a variable which is not ARG mentioned in the Dockerfile, Docker will complain.
When building an image, the only thing you can provide are ARG values, as described above. You can’t provide values for ENV variables directly. However, both ARG and ENV can work together. You can use ARG to set the default values of ENV vars.
- Use
ARG
for the simplicity unless you need to change the variable at runtime
dependencies
- startup order
- tldr; make your tools more resilient to outages
troubleshooting
one-off exec
Sometimes you want to debug something on a running container.
podman exec c10c378bb306 cat /etc/hosts
one-off
podman exec -it fdea48095fe1 /bin/bash
shell process
Copy file from container to host
podman cp <containerId>:/file/path/within/container /host/path/target
new kernal
If getting weird errors, might be cause of new kernal installed.
ERRO[0000] 'overlay' is not supported over extfs at "/var/lib/containers/storage/overlay"
duplicate mount points
- volumes used by other containers on the system could be causing issues
- can see all errors in systemd logs instead of last reported on CLI
podman system prune
WARNING! This will remove:
- all stopped containers
- all networks not used by at least one container
- all dangling images
- all dangling build cache
podman volume prune
- dockerfile vs. compose issue makes it seem that isn’t the case today