Shared Resources

// #Rust #Craft

We ran into a classic design trade-off while thinking through a small tweak for how kyoto handles new peer gossip. The use case is pretty simple and contained which makes it easier to then analyze the design patterns.

The bitcoin peer to peer protocol has addr and addrv2 messages for sending peer connection information to other peers. These can be requested with a getaddr message, but can also just be randomly gossiped between peers. As of kyoto version v0.11.0, when a peer sends over an addr message, it bubbles up through the network module to the top level Node. Node processes this by then asking the network module (through the PeerMap) to store the new peer info for later. Now ideally, Node just focuses on light client things, it doesn’t really care about how the network state is managed. We don’t want it to be interrupted every time a peer gossips some new peer info.

The addr messages can be handled internally by the network module. We simply want to store the info in the peer store so we can connect to the new peers later if required. Now the only thing left to decide is who to handle this responsibility. We narrowed it down to two options.

  1. Push a shared persistence handle down into the Peer’s so they can short circuit the addr processing and just add it to the database.
  2. A background task one layer up from the Peer’s in the PeerMap (the peer manager) which owns a channel receiver to pull off messages and push them into the database.

There is a ton of context for each of these, so let’s start with #1. Each peer instance is running in its own tokio task. So it is possible that each peer is running on its own thread, it is up to the runtime how it divvies up the tasks. This implicit implementation detail needs to be kept top of mind when sharing resources between tasks. Even if the persistence handle is implemented in a thread-safe way, the instance needs to be shared between peers with an Arc. As we saw with the channels, tasks add complexity to tracking ownership and it is necessary in the case to punt it to runtime since many dynamic owners will be owning the handle. And in this case, the persistence handle is not thread-safe, so needs a Mutex wrapper as well since a task could be on a different thread. The std library mutex implementation could be used, but it might block a task waiting for the lock. It is not “async” aware. Tokio has its own mutex implementation which is unsurprisingly async-aware, but comes with a heavier runtime cost. Given that this lock is being used to perform I/O tasks (could be a long time), it is a safer bet to use the tokio version.

The network module is not a behemoth of code, but it still feels a little sketch to toss around a bunch of new references to a shared resource like this. When does it break down? And what does that look like? I think there are two things to keep an eye on when running with this pattern. The first is the cognitive load of ownership and responsibilities. If a bunch of different parts of an app have the same read/write interface for this store, in the long run, they will all start relying on the full interface and extending it. It will become difficult to reason about small changes which suddenly spread across the whole app due to this link. The second is the hidden control flow. As the number of locks grows, the number of different patterns of lock/unlocks grows as well. There is an implicit connection between all the lock points. For example, I am grabbing a lock over here in the network module, which now makes it related to whatever is going on in the chain module. This connection is not obvious due to the layout of the code.

Both of these issues can be managed by keeping the usage relatively contained. But are there ways to avoid it completely? That was a direction I explored in #2, where the persistence handle is clearly owned by one thing, the PeerMap. Ownership is clear and control flow is clear, total control and visibility. A lot of the coordination complexity has been dumped on the channel abstractions which connect the PeerMap and the Peer’s. But it ain’t all good. This requires a background task to be fired up to process the peer events. It is generally a bad call to fire up a task, a side effect, in a builder. So the PeerMap interface needs to be extended with a lifecycle-esque function that the caller kicks when it is ready for the task to fire. The original goal of this change was to hide the peer management from the calling Node, but now it is exposed to the lifecycle of the PeerMap. What if this task starts to grow in complexity? It might require another shutdown function so resources can be cleaned up.

I realized I was drifting towards the actor pattern, something I hadn’t really thought about since my scala days in microservice land. The tradeoff of ownership for lifecycle management is classic. And we are already using both in kyoto, for example the Node itself is an actor which owns its data and communicates with messages. There are so many subtle tradeoffs between the two, I doubt there is a golden rule when to switch from one to the other, but will be curious to see if we ever reach it here.