Abstract Algebra BIP-324 Covenants Hiding Traffic Kyoto Last Mile of Lightning LSAT Network Metadata Spiral

Push Decode

// #Rust

Hacking on the bip324 library has exposed me to the little details of moving bytes around. Given the whole, network of peer nodes thing, it is not surprising that there is a lot of this going on in the primitives of the rust-bitcoin libraries. And man, it has generated quite some discussion. I want to lay out some terms, at least how I think of them, and then get into the subtleties of the bitcoin library.

Let’s talk encoding and serialization. I am not convinced the definitions of these are standardized across domains, they often get used interchangeably from what I can tell. So the following might just be my interpretation. But in any case, helpful for me to think it out.

  • Serialization // Convert a type into a format that can be stored and reconstructed later, the focus in on semantic preservation. Preserve data and relationships, with robust interoperability in mind.
  • Encoding // Transform data from one code to another with a focus on physical representation in a medium. With computers, that medium is usually binary. But this could be any set of symbols with meaning.

In a program, we are generally working with abstract types. At some point we usually want to send a type over space and time, be it save it to disk or transmit on the wire. Now we could take a snapshot of how the type is represented in-memory on the local machine and just wing it. But the chances of that working on the other end, possibly on a different machine in a few years, is very low.

So first we describe to our counterparty (ourself in the future or a buddy) the structure of our type. What are its components and how do they relate. This is a serialization of the type. Next, we need to define how those components are physically represented, most likely in binary. This is the encoding.

Abstract Type → [Serialize] → Structure → [Encode] → Medium

My mental model.

Sometimes the serialize and encode steps are really obvious.

User Object → [JSON Serialization] → JSON Structure
                                   [UTF-8 Encoding] → Bytes
                                   [UTF-16 Encoding] → Different Bytes

Sometimes serialization and encoding steps are clearly separated.

Other times though, like a bitcoin p2p message, the binary encoding influences the structure to the point that the steps are done at the same time. A transaction has no middle representation, it just goes straight to binary.

The separation of serialization enables flexibility at the cost of performance. With bitcoin’s focus on extreme space efficiency, critical exactness, and bandwidth sensitivity, it makes sense to ditch flexibility and just lock it all in.

With this in mind, the rust-bitcoin library has an Encodable trait which is implemented for each primitive type (e.g. transaction).

pub trait Encodable {
    /// Encodes an object with a well-defined format.
    ///
    /// # Returns
    ///
    /// The number of bytes written on success. The only errors returned are errors propagated from
    /// the writer.
    fn consensus_encode<W: Write + ?Sized>(&self, writer: &mut W) -> Result<usize, io::Error>;
}

The Encodable trait has just one function on it.

The consensus_encode name is a little ominous. This distinction from just a plain encode is to re-enforce this super optimized encoding used to store and transmit bitcoin data. This is attempting to match the exact rules of Bitcoin Core. And these rules shouldn’t be messed with in any way, since the bytes produced are used throughout the bitcoin system in hashes and such. Nothing is stopping you from encoding a type into hex symbols, just be sure to only use that to display data or something.

Sticking with just the write half for now, a little further up in the encode module there is a helper serialize function defined.

pub fn serialize<T: Encodable + ?Sized>(data: &T) -> Vec<u8> {
    let mut encoder = Vec::new();
    let len = data.consensus_encode(&mut encoder).expect("in-memory writers don't error");
    debug_assert_eq!(len, encoder.len());
    encoder
}

The serialize wrapper function.

This is an interesting design decision I’d like to understand better. Coming from recently working on a sans-io based library, this is kinda opposite how I would have done it. Sans-io would push you to define a byte vector encoding for a type, and a helper function would perform the I/O to write those bytes out. Here, we have the serialize wrapper performing the “I/O” into a vector. The vector type is extended to handle the writes bitcoin::io::Write (which is a re-export of std::io::Write in standard environments) and grows as needed.

What is nice about this order is it skips an intermediate buffer when writing bytes to a stream. A stream is a higher level abstraction, what the standard read and write traits define. You write bytes to a sink and read from a source, and the bytes just stream back and forth. No assumed structure. Under the hood, the stream could be a network socket, or a file, or even a in-memory buffer. In any case, a type’s fields are directly written to the stream instead of first being bundled together into a new vector of bytes, and then shipped off to the stream.

I find “serialize” to be kind of a funny name for the serialize function given my understandings above. The types are being encoded and then written into a vector, this doesn’t seem like less work. But given that the serialize function is used for doing more work on the bytes, like creating a hash or signature, it kinda makes sense from that perspective that it’s a halfway point. Maybe I would have called the function some sort of encode_consensus_to_vec though to make it clear it is still just consensus encoding? Little bit of a mouthful though.

In any case, the encode-straight-into-a-stream approach makes sense for bitcoin’s v1 p2p protocol. If you have a really large Block message, it can be written out in chunks, the writer does not need to be aware of the whole message. Although when sending a p2p message, you first have to write out a header which contains the length of the message and a checksum. The library handles this with a two pass approach. First, write the message to a double SHA256 engine which returns the checksum and length. Now you can write the header. And then second pass to write out the rest of the message. Neither pass requires a full new allocation though which is cool.

This is harder when using the BIP-324 v2 p2p protocol. The v2 protocol contains an encrypted length prefix and then the encrypted packet which also holds an authentication tag from the AEAD. It is another encoding layer, not just a header. A separate cipher instance is used for the length prefix which buys us some flexibility. But you would still need to somehow know the length of the packet up front, which maybe requires adding a new function on Encodable. The second challenge is the underlying chacha20poly1305 implementation would need a streaming interface where it could encrypt_chunks and keep the poly1305 state updated as well. It would need to expose some sort of finalized function which then spits out the tag at the end. It is much simpler for the AEAD to expose just one encrypt function which returns a tuple of the encrypted data and the tag. Possible, but way more complex than the elegant v1 version.

Another pain point of the current Encodable is that is tied to the standard library’s I/O traits. This seems really reasonable, however, the I/O traits are not in core, so not friendly in no-std environments. This is why rust-bitcoin has its own io module. But tying the trait to I/O also means standard I/O must be the I/O driver, which is not great for async runtimes which have their own I/O traits to drive.

So, there are at least three reasons why the bip324 library always uses the serialize function and buffers before writing.

  1. No length function on Encodable, or more generally, a way to find the length pre-encoding.
  2. The chacha20poly1305 AEAD library does not support the complex streaming encrypt interface.
  3. Async runtimes have their own I/O traits to drive.

Would there be a better way to define Encodable to help at least iron out #1 and #3? There has been work to replace Encodable with a “push based decode”. I believe this is another name for sans-io. A caller is responsible for coordinating the bytes between encoding and the I/O, instead of having the processes tied together like the current Encodable trait. I am not sure I quite grasp the difference between “push” and “pull” in this context. I guess on the read half, the caller pulls from the I/O source and pushes into the decoder. And on the write half the caller pulls from the encoder and pushes to the I/O sink? In any case, I think Kix is working on this push_decode library which I think helps mitigate, in some spots, the extra buffer that sans-io introduces. So long story shot, maybe helps with #3? I am not sure how much it buys for a BIP-324 streaming encoder.

On the decode half, a BIP-324 decoder cannot stream back decrypted bytes, because it needs to first read the whole packet to get the AEAD tag and authenticate the data. Maybe the chacha20poly1305 AEAD library could also be extended with a streaming decrypt interface, but that sounds even hairier when things go sideways. “I sent you back some bytes, but turns out, the tag was off so discard those last few!”.

Given that the BIP-324 encoder/decoder is closer in line to the I/O sink/source, and both halves kinda require buffering, will push_decode help that much? If the lower level Encodable/Decodable traits are migrated to sans-io then they can no longer push/pull directly from the stream. The caller takes on this responsibility. And a naive implementation could use a type-sized buffer before shipping the buffer off to the I/O. This is not the most efficient use of memory, a smaller buffer is probably possible. But in a networking context, that maximum sized buffer is required! So sans-io doesn’t add performance cost, it’s already there.

But bitcoin isn’t only used in a network context. Many devices, like hardware wallets, are designed to actually never touch the network. These are contexts without the BIP-324 encoding/decoding and could be memory usage sensitive given the hardware constraints. This might be the spot where a naive sans-io-to-io buffer would hurt. So probably where push_decode would shine, allowing the type encoder/decoders to be efficiently linked up (stream) to the I/O drivers (std, async) avoiding buffer overhead.