2024.11 Vol.1

// Metadata Leakage #P2p

Two parties are communicating over an un-trusted channel, but they want all their data to remain hidden from others. This is the story of the internet. Luckily, encryption came along and has made it possible to efficiently hide a lot of the data. But not all of it.

I think we can group all the data into three categories.

The first is the data content itself, whatever messages the two parties are exchanging. Once encrypted this is called the ciphertext.

The second is side-channel data. This is where the real world gets a little more less contained than “two parties and a channel”. Side channel data can be anything that sheds even just a little bit of light on the encrypted data. A third party could analyze traffic patterns between the two parties, or even electromagnetic emissions of a party’s CPU. The idea is to combine these side-channel data vectors to triangulate some information about the parties, maybe even enough to crack the data content.

The third category is metadata about the parties themselves. The big one on the internet is a party’s IP address. There are some scenarios where two parties would like to exchange messages, but don’t even want the counter-party to know their IP address. This is a tricky requirement since in order to communicate the counter-party needs to know where to send data…generally done with an IP address! This is kinda like asking someone to mail something to you, but refusing to give them an address.

The first two categories are very interesting, but largely mitigated with modern cryptographic schemes. The one I am more curious about at the moment is the metadata, since it seems almost impossible to keep it hidden entirely. There are some patterns today, but they leave a little to be desired.

The simplest solution is to add a proxy server in-between the two parties. This server simply reads data on one end and writes it on the other, and vice versa, but it hides the two parties from each other. Simple, but obvious downsides. It adds another hop to the network, lowering performance, and the proxy becomes a point of failure. And while privacy is maintained between the parties, the proxy now knows all the metadata. If the data is not encrypted, it even knows what the parties are talking about.

A more complex strategy is anonymity networks. These are networks of servers which two parties can proxy requests through. The networks can use a technique called onion routing where each server in a route across the network only knows about the hop before and after it. So compared to a single proxy server, the privacy is increased since the knowledge of the request is diffused across the network of machines. But the performance issues are further exasperated since there are now multiple hops across the network. If a third party somehow had some “Eye of Sauron” power and was able to view the entire network, the network would then be equivalent privacy-wise to the simple proxy. Is the complexity of a network over just a proxy worth it? The number of nodes in a network probably determines its value since it becomes hard to be the eye with more nodes.

If a proxy server’s capabilities are extended to hold state, and not just simply forward requests in a state-less fashion, a request/response between parties could be broken up into two steps. The first request stores a message on the proxy, the second pulls it down. This pattern is called Store and Forward, kinda like a mailbox. If only two parties are using a given proxy, this doesn’t change anything. However, if the proxy is being used by many parties, the anonymity set grows. It becomes harder for an eavesdropping third party to connect which two requests are linked between two parties. But how do the two parties link the requests to each other? The sender might write a message with ID 123 and the receiver knows to pull down the message with ID 123. The big new requirement here is the parties need to agree on an ID in an “out of band” (e.g. a QR code in real life) channel. An anonymity network doesn’t require this out of band channel. An ID makes it easy for the proxy runner to re-link the requests and maybe even an eavesdropper if the request ID metadata is leaking. Assuming all the messages are really small, I wonder if a pull request could just return all the messages. The calling party would only care about one message, a needle in the haystack, but this would de-link the parties. There are a lot of performance issues with that take, but perhaps some privacy-preserving filtering patterns like the ones used in BIP157/BIP158 could help mitigate things.

If a use case keeps the data messages small, can pass some information out of band, then a store and forward pattern is simpler and more private than an anonymity network. Ideally though, there are multiple proxy servers the parties can use so they don’t introduce a single point of failure to their system. So that out of band communication includes an ID for the channel, plus a handful of locations (servers) to meet at.

Could NOSTR relays be these meeting points? A NOSTR relay could choose to be built for specific use cases, like a temporary holder of relatively short lived (thinking like an hour) messages. A nice thing about NOSTR is the anonymity set could be really large if you introduce other use cases on the relay. NOSTR is also being built with ecash/lightning in mind, so one might also be able to layer in some financial incentive to run a meeting-point-relay. NOSTR clients are dead simple HTTP, so should not be a large lift for implementing parties.