They would take their software out and race it in the black desert of the electronic night
4793 words

Samourai Dojo

There is no greater solitude than that of a samurai, unless it is that of the tiger in the jungle...perhaps...

A guide for running Samourai Dojo with an external full node.

Why an external full node?

I am running a full node on my bare metal box. I understand it and am not quite ready to migrate it to Dojo's docker-compose managed image. I like the idea of modularity and I have other services which depend on the full node. Dojo can expose it managed full node, but that is a project for another day.

Configurations

Dojo has some advanced docs on running an external full node, but there are a few missing tips which tripped me up.

The complexities boil down to full node configurations (e.g. /etc/bitcoin/bitcoin.conf) to make sure the Dojo docker images can talk to the node. The Dojo docker images are on their own "bridged" docker network, so its not as simple as always putting 127.0.0.1.

Here is my real life working configuration with the password changed:

rpcport=8332
rpcuser=bitcoin
rpcpassword=topsecretpassword
rpcallowip=192.168.1.0/24
rpcallowip=172.28.1.2/16
rpcallowip=172.28.1.7/16
rpcbind=0.0.0.0
rpcthreads=4
rpctimeout=300
txindex=1
server=1
dbcache=300
zmqpubhashblock=tcp://0.0.0.0:9502
zmqpubrawtx=tcp://0.0.0.0:9501
  1. rpcallowip
    • There are three entries: 192.168.1.0/24 for LAN access, 172.28.1.2/16 for Dojo node (nodejs) access , and 172.28.1.7/16 for Dojo explorer access
  2. rpcbind
    • The node needs to listen on all interfaces (0.0.0.0) not just the local loopback 127.0.0.1
  3. zmqpubhashblock and zmqpubrawtx
    • Similar to rpcbind, the zmq settings also need to listen on all interfaces with 0.0.0.0
    • You know there is a problem with Dojo listening to the zmq updates when the nodejs logs keep mentioning 0 transactions processed (zmq is where it gets this info)

It should be noted that these network settings are more open than the standard 127.0.0.1 and extra precautions should be taken in your router and firewall settings to make sure access is not granted to things which shouldn't have it (e.g. the internet).

System manager

While not a great fit with docker-compose, I am using Systemd to control starting and stopping Dojo.

[Unit]
Description=Samourai Dojo
Requires=docker.service
# Using external full node
After=bitcoind.service docker.service

[Service]
# the dojo script used the -d parameter in docker-compose
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/lib/dojo/docker/my-dojo/dojo.sh start
ExecStop=/usr/lib/dojo/docker/my-dojo/dojo.sh stop
User=dojo
Group=dojo

[Install]
WantedBy=multi-user.target

One quirk that I am not sure if is a big issue, the docker-compose process appear as whatever user has UUID 1001 when viewing from the host (e.g. running ps aux)

Fitbit

I fight for the user.

I recently had to update the ol' resume since I am finally leaving Fitbit. I thought it turned out to be a pretty good story, so I'll capture it here too since the resume will change with time.

There is just one section under my experience header on the resume: Fitbit. I joined fresh out of school back in 2011 and had a crazy experience which ended with Google acquiring Fitbit in 2021. I never wanted to work for a company the size of Google and decided it was time to go, but let's go back to the beginning.

On my first day, my first task was to get a keyboard from the Radio Shack down the street (you can tell that time has passed). After that my manager drew up a picture of the current backend architecture at Fitbit. It was comprised of a handful of tomcat webapp servers and some MySQL databases. Maybe 10 boxes total on the whiteboard.

For the first few years at Fitbit I was on the only backend engineering team. We wore a lot of hats as we dealt with the incredible scaling challenges of a successful product. Boxes were added to the whiteboard, new technologies were introduced, corners were cut, lessons were learned, technologies were removed. Repositories were created, repositories were consolidated (monorepo!), kafka and micro-services were introduced (sad to report that as of 2021 the original "monolith" java app was still running despite many heroic efforts), NoSQL (Cassandra...:shakes-fist:) found its way on to the whiteboard. I think I personally only took down production once when I removed an index on the friend invitation table (whoops).

During the middle years of my time at Fitbit, I held IC roles on the initial SRE and Developer Productivity teams which were created out of necessity. SRE was not really my thing, but I did develop a soft spot for dev/prod. I took that with me when I joined the Data Platform team which I stayed on until the Google acquisition. It was on Data Platform that I found my passion for simplifying tools and interfaces.

Fitbit was in the unique position where it had spent almost a decade collecting data and doing little more than presenting it back to the user (e.g. you walked x steps today!). All the data storage and encodings were optimized for these user point queries.

Another challenge was the Big Data tooling at the time was (and for the most part, still is) geared towards serving ads, a domain where it is OK to get things wrong once or twice as long it works in general. Fitbit researchers intended to analyze the Fitbit data and serve health recommendations, a domain where you cannot be wrong.

Lastly, initial algorithm creation attempts at Fitbit were painfully slow. Researchers developed algorithms in python on their laptops. They would hand these off to a backend feature teams who translated them into java and deployed them to production. If there was a bug (there is always a bug) the process started over. This would take months.

The Data Platform team's mission was to "unlock" this existing data and enable our researchers to quickly develop quality models and algorithms. Over four years, three data warehouses, two datalakes, and many smaller iterations, we delivered.

I believe the team's greatest accomplishment was a platform which enabled our researchers to create an FDA approved atrial fibrillation detection algorithm (and the potential to create more). Data Platform designed and delivered tools which had our researchers (who desperately wanted to sling python notebooks on their laptops) writing high quality scala code, which they deployed to production with a CI/CD process. The iteration time for algorithm development was measured in minutes instead of months.

How did we do that? Not sure exactly, but I do remember one whiteboarding session early on where we drew up everything required to develop, deploy, and maintain an algorithm. There were a lot of boxes. We set out to find how we could hide them from the researchers. The answer was not immediately obvious, and this explanation makes it sound easier than it was, but by shifting things around we designed a "harness" for our researchers code to live. This wasn't revolutionary computer science, but the simplicity sure was satisfying.

My thoughts on my years at Fitbit can be summed up in small story from 2020. When Covid hit, Pria and I decided that was the sign for us to move back to San Diego. We bought a house one freeway exit away from where my Grandma lived, which was nice because her health had been deteriorating and this made it easy to see her more often. One night after she had dinner with us, I was driving her home and telling her about how our afib algorithm development was coming along. When I dropped her off she told me in her thick Dutch accent "I should wear the Fitbit". Less than a month later she passed away from a stroke. This isn't a story about how I wished we had shipped the algorithm sooner, just that I know for sure that the work will help someone out there in a big way.

Coordinated Omission

Are you sure that we are awake? It seems to me that yet we sleep, we dream.

Performance testing is a roller coaster of emotion. The highs of a service looking invincible after is survives being pounded by the load generator. The lows of the service instantly dying in production.

One of the performance test gotchas that I have fallen for (a lot) is coordinated omission.

Performance Principles

Latency is the time it took an operation to happen. Each operation has its own latency. When testing the performance of a system, we care about all latencies, not some "common case" subset.

We often justify looking at the 95th percentile (or lower) because that is the "common case". However, when a user interacts with a system, a session usually involves multiple requests. For example, web pages will hit a server hundreds of times. What are the chances all requests are in the 95th percentile? If one of the requests is outside of the 95th, how much does it dominate the session? Chances are, a lot. The higher percentiles determine the quality of a session and are what performance tests should focus on.

What are You Measuring?

Coordinated omission involves measuring service time when you meant to measure request time.

Service time is the amount of time spent doing work for a request. Request time is the amount of time a user waited for a request to be complete. It is service time + any time spent queued up.

The following ascii art represents a load generator making request to a system over time. Each request is represented by an x. The _ represent when the system goes down, for whatever reason, and the load generator is waiting for the last request it sent to return:

xxxxxxx_______xxxxx

When the load testing is complete, the data shows one really slow request and the rest really fast. The 95th percentile looks spectacular. This data is a good representation of service time, but a poor representation of request time. When this service hits production and goes down again, a lot more requests are going to be affected by the outage than what was captured by the load generator. The load generator was "coordinating" with the system to hide the system's faults. And it hides the cases we care about the most: the high percentiles.

A sign that your performance data might contain coordinated omission is a hockey stick like percentile graph.

Appendix

The red pill.

Dependencies

Operator! Get me the president of the world!

Dependencies are simultaneously the greatest and scariest.

They allow developers to reuse code, a pillar of software engineering. This can be code developed by another company, like the omnipresent Google Guava library. Or it can be an internal library that developers publish for other teams.

But dependencies tend to have dependencies. And managing this graph of dependencies is impossible.

The Diamond Problem

An example of this impossibility is the "Diamond Dependency Problem". Library A depends on libraries B and C. B and C both depend on library D, but different versions of D. What version of library D should library A use?

A +---> B ---> Dv1
  |
  +---> C ---> Dv2

Programmers have implemented different strategies to deal with the "Diamond Dependency" and none have solved it. Build tools like Gradle are able to recognize the problem. But after recognizing it, Gradle picks one of the versions (by default the newest) and tosses it on the classpath. No guarantee that version will work and failures often occur at runtime.

NPM takes a different approach. Each dependency has a copy of all its dependencies and these copies are not shared. Library B has its own library D and library C has its own version of D. This is kicking the can down the road. Now developers discover breaking changes at runtime in horrible fashion. For example, library A asks for a model object from library B and passes it to library C. What if library D defines this model? There are now two versions of library D's model floating around which can lead to awful serialization problems.

Avoid Exposure

One extreme strategy to avoid dependency hell would be to treat each version of a dependency as a new dependency. The trade-off being massive overhead (updating all package imports) every time a dependency was upgraded. This is on the far end of the spectrum and probably not worth it.

The goal is to limit exposure to the set of problems caused by multiple versions of a dependency existing in a system.

If you control all applications which make up a system, a version of a dependency can be forced across the system. This is somewhere in the middle of the spectrum which has the wild west on one end, and each version as a new dependency on the other.

Another mitigation strategy is to limit the interface in which applications communicate. Instead of an application relying on a library directly, it can make an RPC through a limited scoped interface such as a Thrift model. The Thrift IDL is designed specifically to keep a schema simple even as things evolve. This would help avoid library defined model conflicts at runtime, but errors will still always be possible. The best hope is to limit them while taking advantage of dependencies.

Leaky Concurrency

And I would play with fire to break the ice.

It was time to start breaking up the Monolith into microservices. The Monolith was a massive application which contained all our project's code and it was almost impossible to maintain. As we considered our options for our next generation tech stack, I heard the word "asynchronous" a lot.

We chose a microservice framework which exposed an asynchronous API. This was a radical change from our old tech where each request had it's own thread. Following the hip new trend, I thought our database driver should also use an asynchronous API.

There were performance and scaling benefits with the new tech, but there was also a large increase in bugs. What happened?

I needed to dive deep under the abstraction layers to understand what was going on. What is "asynchornous"? Is it worth it?

There Is No Thread

At the lowest level of computers everything is asynchronous. This was somewhat of a surprise after years of writing synchronous code. It probably has something to do with the world being asynchronous though.

The path from high level language code to bits on the wire is complex. But just knowing the gist is helpful.

Some synchronous code makes a blocking I/O call. The language runtime library translate that into some kernel commands. The kernel instructs a hardware device, through a device driver, to perform the actual I/O. At this point the kernel moves on and the device is busy sending signals.

Asynchronously.

When the device finishes its I/O it interrupts the kernel. The kernel makes a note to pass that message back up to user land. The language runtime is waiting for that signal so the synchronous code can continue its "thread" of execution.

So down at the lower levels there are no threads. Threads are a higher level abstraction that developers have been working on for the past fifty years.

Structured Programming

Back in the 70's the creation of this abstraction was a big deal. A debate raged between the computer science heavy weights on whether we should restrict our code to make it easier to reason about. This is when developers started to frown on GOTO jumps despite their power.

GOTO leads to spaghetti code which is hard to reason about. So developers created and adopted some structures to keep things simple. These exists in all major languages today. Things like control flow (if/then/else), code blocks, and subroutines (functions and callstacks).

This also solidified the causality of code. If you see the following: g();f(); you would assume that the function g is run before f. Programmers take this concept for granted these days.

Programmers have spent a lot of time and effort building up this "thread" concept. But these new fancy asynchronous APIs with their callbacks look an awful lot like a GOTO.

Performance and Scalability

Asynchronous implementations get sold on their performance and scalability. How much performance and scalability though, depends on the use case.

Let's take the case of a monolith application broken up into microservices. A service gains performance if it can query other services in parallel. And a service is more scalable if it can service hundreds of I/O requests on one thread since it takes less memory.

For the database driver, developers did not often query the database in parallel. But a service did perform hundreds of parallel I/O requests to the database. In our old tech stack, each request had its own thread. These threads would block on database I/O threads, wasting the memory resources they consumed.

The nature of this application meant it spent most of its time waiting on I/O. It would also spend some CPU marshalling data around, but was by no means CPU bound. This is a case where it makes sense to have one thread manage all this waiting and light CPU work.

So we use our limited resources more efficiently. But at what cost?

What Have We Lost

Our code is now full of callbacks. Callbacks shatter structured programming. Exception handling no longer works. Try with resources no longer works. And what's worse is that these fail silently. The compiler isn't going to tell you that the code you wrote won't actually catch any exceptions.

We have also lost the free back pressure. If ten threads are running synchronous code and the database hiccups, all ten threads will pause. They will no longer accept new work and this back pressure propagates upstream. Asynchronous code keeps accepting new work even though none is getting done.

Arguably the worse loss however is causality. While some asynchronous frameworks guarantee all code is ran in one thread, removing a large set of concurrency bugs, it is not obvious in what order this code will be ran. There are many different possible logical threads of execution. g();f(); no longer means what a developer thinks it does.

A Leaky Abstraction

Developers struggled with the loss of causality when migrating to the async API of the database driver. When was code being executed? And from where?

The callbacks were exposed as Futures (fancy callbacks) which readily accepted more GOTOs to be tacked on (in the form of functions). What wasn't obvious was that these GOTOs would be run by the underlying event loop implementation. Slowing down the event loop thread caused hard to debug problems.

This burden of making sure code was being ran in the correct spot was new. And as code gets more complex, keeping track of these logical threads of execution becomes more difficult.

The thread has become a leaky abstraction.

An Old Hope

So I don't like asynchronous interfaces, but its undeniable that there are cases where operating system threads are not the best concurrency model.

Maybe coroutines are the best of both worlds.

Each of these "threads of execution" have their own stackframe. They have all the same characteristics as normal threads, but have the potential to be ran concurrently.

This isn't free though. The Rust language actually used to have coroutines as first class citizens, but deprecated them. This is because not only must the compiler have the ability to turn functions into state machines, but a runtime is needed to schedule these coroutines. Rust being a low level systems language didn't want the burden of this runtime scheduler.

A language like Go doesn't mind it though. Maybe the future is here.

Monorepo

They would take their software out and race it in the black desert of the electronic night.

I was once part of a developer holy war.

The team could not decide how to organize our code. Should it live in one repository, a monorepo, or should there be a repository per project?

The war was ignited by a challenge we were facing: scaling developer productivity as we grew.

Everyone agreed that code organization could help combat this producitivity loss, but which strategy should we take?

Much to the dismay of some, we ended up with the monorepo.

And it was the right call.

Productivity Breakdown

We developers face scaling challenges all the time. Some are easy to predict. Some application might work fine for ten users, but a we wouldn't expect it to hold up to millions of users without some changes. Some challenges are not as obvious.

At one point in time, all the developers of Fitbit were in single room. I took for granted a lot of properties that come with a team that size. We merged code straight to master and resolved conflicts in person. Even if other developers were not working on code related to a change, they had a good gut instinct on the effects it would have. This instinct allowed us to detect breaking changes before they got to production.

However, as the team grew, errors began to happen at an exponential rate in development and production.

It's tough to say at what team size the project began to degrade, 10 devs, 30 devs, or 100 devs. But changes that used to be easy began to require hours to coordinate and were error prone. The size of our team was taking a toll on productivity.

And that is when the monorepo versus multiple repo debate took off.

The Influence of Code Organization

Code organization has the potential to influence how easy or difficult it is for a developer to discover code, build code, and test code.

Discover: Where does this code live? Where is this code used?

Build: How do I build this project? How do I manage it's dependencies?

Test: How do I test this code? How do I test the code that depends on this code?

Developer productivity would remain high, in the face of a growing team, if these tasks remained easy. So which code organization strategy influences these the most?

Discover

Finding usages of code is marginally easier in a monorepo, since all code can be grep'd at once. But simple tools applied to the multirepo approach produce the same effect.

Relative to the other tasks, its a wash.

Build and Test

Buiding and testing code is where a monorepos shines because a monorepo can enable faster failures and avoid technical debt.

To enable faster failures, we must leverage a monorepo's one inherit advantage: atomic commits. A developer can make a change affecting more than one project in one all-or-nothing (a.k.a. atomic) step. The multiple repository process to push out a change often follows the open source pattern. A developer patches a project and uploads it to a central location. At a later time, a dependent project pulls down the new version. There is a layer of indirection which forces the process to have multiple steps.

So to perform a library update with multiple repository code organization, waves of builds and tests have to be run. First, a developer patches the original library and publishes it to a central location. The downstream projects which depend on the library need to test the new version. A tool, or an unfortunate developer, needs to update the downstream projects and test them all. If there is a break, the original library patch needs to be rolled back.

And what about downstream projects of the downstream projects? The waves of building and testing continue. Each wave adds complexity and brittleness, especially in the face of rollbacks.

Using atomic commits in a monorepo, we avoid the waves of builds and tests. Instead of pulishing a new version of library and then coordinationg testing of affected projects, we do it in one step. The library and all affected projects are all tested on the revision containing the change. This allows dependent projects to fail fast on changes.

Avoiding Debt

If a developer is used to the open source, multiple repository model, this monorepo approach sounds like a lot of work. To update a library I have to update all dependent projects at the same time? Why me? The answer is you, because the developer best equiped to deal with a breaking change is the one making the change.

An Unnecessary Interface

At some level of scale it makes sense to break a monolith application into micorservices. Microservices accept that the complexity of the system increases (more than one live version of code, service discovery, load balancing) versus a monolith. But in this case, the complexity can be worth it.

Is there added complexity for multirepos? The trials of building and testing code exist, but there is also a social element. Conway's Law states that the structure of projects that people build reflects the social structure of the people that build them. In software engineering, this often manifests itself as code interfaces between projects. And these interfaces are often where bugs occur.

Multiple repository code organization encourages another interface within a system, where as a monorepo discourages it. One less interface to cause problems.

Embrace the Monorepo

A monorepo has its faults and doesn't solve everything, but it has the higher potential to maintain developer productivity as a team grows.

P.S. Deploying

As soon as a project requires more than one machine to run on, it will have to deal with artifact versioning in production.

It sounds weird to have a monorepo with microservices, but deployment and code organization are orthogonal strategies.

Resources

danluu