Holding on to Garbage

BIP-324 Series

The bip324 library is written in sans-I/O style, which means it doesn’t tie itself to a specific I/O interface. This means it doesn’t block or force async functions. It just handles the shared logic which those two drivers can then use under the hood.

The sans-I/O interface is one of buffer slices. The I/O driver pulls bytes from somewhere (e.g. a network socket), passes them to the sans-I/O library, gets back a new array, pushes those bytes somewhere else. The largest burden of the sans-I/O interface is that there is now a function call where there usually isn’t. If one just assumes an I/O implementation (which is most of us most of the time) this usually all happens in one function.

// Calculate required buffer size for encryption. 
pub const fn encryption_buffer_len(plaintext_len: usize) -> usize

// Encrypt into caller-provided buffer.
pub fn encrypt(
    &mut self,
    plaintext: &[u8],
    ciphertext_buffer: &mut [u8],
    packet_type: PacketType,
    aad: Option<&[u8]>,
) -> Result<(), Error>

The OutboundCipher interface which encrypts the plaintext. The interface is all buffer slices.

Now technically, sans-I/O does not require the interface implementation to have zero memory allocations. Zero allocations means the library is no_std, no standard library, compatible which allows it to be used in embedded environments which do not have access to the standard library. For most of the bip324 library, it is just not that big a leap to go from the sans-I/O requirements to the no_std ones. And it is easy to add an allocation wrapper function around the no_std version which maintains the sans-I/O requirement.

But there is one tricky spot. Before the ciphers are completely fired up and a caller is just using the encrypt/decrypt operations, BIP-324 requires a handshake to be performed between the peers in order to establish the channel. The handshake, as described back in the Typestate Pattern log, is a non-trivial sequence of steps. Two aspects of handshake, garbage bytes and decoy packets, are used to hide the shape of the traffic. Quite a bit of garbage is allowed to be sent, 4095 bytes. And technically, any amount of decoy packets can be sent!

With the sans-I/O and no_std requirements in mind, the last step of the handshake with these two large, unwieldy memory requirements gets gnarly.

/// Success variants for receive_version.
pub enum HandshakeAuthentication {
    /// Successfully completed.
    Complete {
        cipher: CipherSession,
        bytes_consumed: usize,
    },
    /// Need more data - returns handshake for caller to retry with more ciphertext.
    NeedMoreData(Handshake<SentVersion>),
}

impl Handshake<SentVersion> {
  /// Authenticate remote peer's garbage, decoy packets, and version packet.
  ///
  /// This method is unique in the handshake process as it requires a **mutable** input buffer
  /// to perform in-place decryption operations. The buffer contains everything after the 64
  /// byte public key received from the remote peer: optional garbage bytes, garbage terminator,
  /// and encrypted packets (decoys and final version packet).
  ///
  /// The input buffer is mutable because the caller generally doesn't care
  /// about the decoy and version packets, including allocating memory for them.
  ///
  /// # Parameters
  ///
  /// * `input_buffer` - **Mutable** buffer containing garbage + terminator + encrypted packets.
  ///   The buffer will be modified during in-place decryption operations.
  ///
  /// # Returns
  ///
  /// * `Complete { cipher, bytes_consumed }` - Handshake succeeded, secure session established.
  /// * `NeedMoreData(handshake)` - Insufficient data, retry by extending the buffer.
  ///
  pub fn receive_version(
      mut self,
      input_buffer: &mut [u8],
  ) -> Result<HandshakeAuthentication, Error> {}
}

One version of the receive_version step which operates on a mutable buffer.

To complete the handshake, the local peer needs to receive all the garbage bytes by reading up to 4096 bytes in search of the previously agree’d upon garbage terminator. It then needs to keep reading any amount of any sized decoy packets until it finds the version packet which is used to negotiate any future upgrades of the channel. Also, it needs to authenticate those previously read garbage bytes with the first packet it reads, no matter if that packet is a decoy or the version packet.

Now here’s the thing. If we only supported the sans-I/O requirements and not the no_std ones, we could simply split this into two steps. Ask the caller to give a non-mutable buffer that they think contains the garbage and the terminator. If the function finds the terminator, it makes a copy of the garbage bytes since they are required in the next step (authenticated with the first packet). This is an allocation! Next, the caller is actually just decrypting packets, although that higher level interface can’t be exposed directly since some future version negotiation might need to happen here. But it does make stuff easier like returning the exact number of bytes required for future decryption. Either the rest of the length header or the ciphertext if the length has already been decrypted.

Are there ways to avoid the garbage allocation in order to keep it no_std? We could just put a 4095 byte array on the stack, but that is not very no_std friendly which defeats the whole point.

Another way is what we have above, the receive_version function is designed to be “re-called” with the caller extending the input_buffer if asked. The garbage is (re)found on every call. This attempts to keep the interface as simple as possible for the caller, but it is still tricky.

Perhaps the operation should be split into two steps. The first attempts to find the garbage and the second focuses on the decoys and version. This increases the size of the API, but it potentially simplifies both steps.

The single-function-extend-input and dual-function approaches share a pain point where a buffer is over-read in search of the all the bytes. That is why some sort of cytes_consumed must be returned to the caller, so that they can “reset” the input buffer. For the single function, this happens at the very end. For the dual function, this happens only after receiving garbage because after that the channel is encrypted and has length headers.

A dual function approach could go with hanging on to a garbage reference between steps so the caller doesn’t have to manage that directly.

impl Handshake<SentVersion> {
  // Returns immutable ref to garbage, but a lifetime is tieing things together.
  pub fn receive_garbage<'a>(self, input_buffer: &'a [u8]) 
    -> Result<(Handshake<ReceivedGarbage<'a>>, &'a [u8]), Error>
  // Needs mutable ref for in-place decryption.
  pub fn receive_version(self, input_buffer: &mut [u8]) 
    -> Result<Handshake<Completed>, Error>
}

A new receive_garbage step which returns any non-consumed buffer?

I see a lifetime issue here though. receive_garbage wants to hang onto the found garbage so it can be authenticated in the next step. It is also returning the un-consumed part of the buffer with the same lifetime. That’s the buffer which has to be passed to the next step, it contains at least the first part of the decoys and version packet.

The receive_version step could switch back to the more conservative “decrypt into a new buffer” function instead of decrypting in place. This would keep all references to the buffer as immutable. But does place a kind of unnecessary burden on the caller to allocate memory for packets which they don’t care about. That appears to be the tradeoff though, burden the caller with memory allocation vs. asking them to pass back the found garbage slice.

impl Handshake<SentVersion> {
  pub fn receive_garbage<'a>(self, input_buffer: &'a [u8]) 
    -> Result<(Handshake<ReceivedGarbage<'a>>, &'a [u8]), Error>
  pub fn receive_version(self, input_buffer: &[u8], output_buffer: &mut [u8]) 
    -> Result<Handshake<Completed>, Error>
}

An output buffer might be required to satisfy lifetime safety, but maybe better to just ask the caller to copy the un-consumed bytes to a new buffer?

I might be over thinking it with the second element of the tuple in the garbage return:(Handshake<ReceivedGarbage<'a>>, &'a [u8]). Theoretically the Handshake<ReceivedGarbage<'a>> type could just expose the length of the captured garbage and the caller could perform their buffer management. The second element is just helpful since buffer management is unavoidable, unless they guess the exact number of bytes of garbage. Which does happen to be easier today since not a lot of implementations send garbage, but probably not something to bank on. So accepting the caller needs to manage a buffer, what is the most helpful thing to return right away? It might just be a bytes consumed usize. If they want a slice the can very easily re-slice their buffer with that value. The usize is simpler than jumping straight to another reference to manage.

Buffer Management

The bip324 library has its own little io module which wraps its main sans-io interface with some I/O managers. These do the “read bytes from source, push through the library, write back to sink” coordination. It is helpful for most users of the library, but also allows me to test assumptions of the sans-io interface.

I am settling on this receive_garbage which returns a handshake with a lifetime wiring it together. The captured remote garbage is used in the next step for authentication. This lifetime usage isn’t unique to the handshake! Earlier, the local garbage was captured in the send_key step and used later to send along to the remote in the AAD context. There is a small difference from the caller’s perspective though. For the local garbage, I think a caller will generally allocate a buffer just for the garbage and pass it down to the library. It is easy to manage the lifetime of that garbage-specific buffer. But the remote garbage arrives in a buffer which probably holds more than just the remote garbage. Most likely, it contains some of the bytes of the following decoy and version packets.

That little twist might make the buffer management more difficult. The handshake might be a good spot for some BufReader wrapper over the input Read data source. It might minimize syscall overhead, there are a lot of little reads in the handshake. But more importantly, due to how short the handshake is, it could simplify the buffer management. The caller doesn’t have to worry about “over reading” in the receive garbage state, consuming some of the following packets by accident. Buffers are implemented by wrapping the source reader and having an internal vector plus some state flags (e.g. consumed index). This would allow for some “peaking” in the receive garbage step without consuming bytes meant for the following receive version step. But if the receive garbage step tosses back a handshake holding a reference to the buffer, can the buffer still be used? What if it needs to be further read to get the next length byte of the first packet? That would require a mutation on the internal buffer, but it currently has an outstanding read reference. There is a bit of a mismatch between the bufreader’s “I pass you refs” and the no_std compatible library’s “I only accept refs” strategies, because the caller now has no granularity control of the bufreader references. They are all or nothing.

The crux of the issues is the caller needs to grow the garbage buffer until they capture the terminator, but this may result in decoy, version, and even post handshake packet bytes being read into the garbage buffer. But after the terminator is found, the channel is now enters the “encrypted” state where all further reading (decoy, version, post handshake) has deterministic size. So perhaps some sort of helper buffer can be created for this state. It takes an I/O source along with a vector of any leftovers from the garbage buffer. It pulls from the leftovers until it is empty before switching a back to the I/O source.

...[garbage][terminator][decoy1][decoy2][version][post-handshake]...
                       ^
                       |
                     After this point, all reads are determinitic, exact-size!

Visualization of the network stream.

I think this might be do-able with a combination of tokio::io::Chain and std::io::Cursor. Cursor is a wrapper that makes any in-memory data structure behave like an I/O stream. It is part of the standard library, but tokio provides async implementations for some of these types (including cursor). Not like it’s doing any async reads, so pretty easy to provide an interface wrapper. Chain is exactly what we want, it concatenates two readers into a single logical reader. It exposes an I/O source which is valid for the rest of the session’s lifetime.

It is a little awkward to return the reader wrapped in a Chain<Cursor<>>, if the caller wants the reader back for whatever reason. But I am not convinced there is a better option. We could just toss back an array of over-read bytes, but that is less clear and makes the internal implementation harder. We could limit the garbage reads to 17 bytes at a time to ensure no over-reads since a version packet is at least 17 bytes (1 header byte and 16 tag bytes). If a remote sends a bunch of garbage though, this could result in a lot of syscalls, so we probably want to recommend to the caller to wrap their reader in some sort of buffer to soak up the calls. This could be made a requirement by changing the reader type mut reader: BufReader<R> so that it is wrapped in a BufReader, but this is pretty restrictive for the caller. What if they prefer some other buffer implementation? Or at a different level in the reader chain?