Little bit of schema
Hack the planet
I spent a little over three years working on the Data Platform team at Fitbit. From a 10,000 foot perspective, our mission was to pivot Fitbit’s product focus from hardware to data. We believed the next great Fitbit product wasn’t a new tracker, but an algorithm built on top of the tracker data.
I metophorically died on a hill which was part of this larger mission.
Setting the stage for the hill
A lot of tech companies made the leap from startup to established corporation around the same years as Fitbit (~2010s). I have compared notes with folks who worked at these sybling companies and Fitbit’s late adoption of data was not unique. Companies were “gut feeling” first, data second, and you can’t blame them given their success. At some point the power of data became undeniable and companies were convinced to invest more in data infrastructure and become “data first”. This lead to monkey patching legacy systems to produce data and, generally, just a bad time.
This was the path Fitbit took. As leadership became convinced on data, directives were given to small data teams to start producing, storing, and analyzing data. Any data.
Schemaless
It is pretty hard to analyze data if it isn’t being stored or much less produced. So step one for becoming data driven was to produce data.
In a small company, I am picturing the early days of Fitbit, this wasn’t too hard of a task. I (being one of a handful of backend engineers) sat next to the mobile engineer. If we wanted to capture a data event, maybe a user pressing the log food button, we could discuss the model for the event and have it in the code by the end of the day. The amount of data we were dealing with was small so we could just store it in the same database we were already using. The technical and social cost of this change was small.
At a large company, I am picturing the mid/later days of Fitbit, this was a hard task. From a technical perspective, we were at the point where storing data appropriatley to be analyzed required separate technology from the existing legacy databases. From a social perspecitve, creating a new data stream now involved many teams and these teams had very different directives and priorities.
Mobile engineers were now split into a hanful of feature teams and the same with backend engineers. As the number of these teams grew over the years, it also became necassary to create “lateral” platform teams which produced and maintained framework code shared by the feature teams. So a data stream had to go through at least 2 feature teams (mobile + data) and at least 2 platform teams (mobile + data) to end up in a new database maintained by the Data Platform team. At Fitbit, we were also uniquely dealing with firmware teams creating events from the trackers themselves. Getting all these teams on the same page about anything was hard, and also kinda new. Might sound weird, but these teams functioned pretty well together by just integrating through long established APIs.
When the small data teams were asked to start analyzing non-existent data, they were desperate for these teams to start producing it. So desperate in fact, that they would accept any data. So they setup some event logging infrastrucutre which had an API which literally accepts any JSON data. It would save it to storage and return an A-OK 200
no matter what. This endpoint was accessible to anyone on the internet, so the friction to create new events was basically 0
.
The data teams were convinced to go this route due to two reasons. For one, pressure to start producing. And the larger challenge, they were unable to convinced 5+ teams to plumb new code through their systems which had hazy benefits for the teams themselves. But there were some downsides to cutting this corner. Massive, massive, downsides.
Schematizing
Flash forward a few years at Fitbit and this schema-less API is now the backpoint of capturing event data at Fitbit. Some might view this as a sucess, a sign that we were now a data driven company. However, producing data is really at most half the story. Maybe less if the data you are producing is shit.
The API was accepting terrabytes of data a day. The cost to store and analyze this data was extremely high. Due to event types often disappearing at random or meaning something different than anticiapted, teams had a hard time trusting the data so the value produced was relatively low. Another unanticipated challenge was the roll out of GDPR and much higher data privacy standards. It is hard to delete a user’s data if you have no idea where that needle is in the terrabytes-a-day haystack (or if it even exists).
The system clearly was broken, but providing just enough percieved value that it could not be flushed away (or preferrably, launched into the sun). Over a 2 year period, each team that was mandated to maintain this system fell apart due to attrition. There was likely many causes for the attrition during this time at Fitbit, but the broken schema-less data capture systemd was definitely playing a part.
At this point it landed in my team’s (Data Platform) lap. And I was tasked specifically with trying to mitigate this tire fire (yea, I had feelings).
Dying on the hill
To me, the problem was clear. The schema-less events, while easy to produce, place an extremeley large burden on all other aspects of the system. This includes “data debt”, a form of tech debt, but way worse. Tech debt can be flushed away on a new deployment, but data debt remains until the data is migrated (worst case, there are just many versions of the same debt never migrated).
If the data producer defined an event’s schema upfront (work for the producer) and maintained it going forward (more work) there are many benefits downstream:
- data discovery: schema informs data consumers (and future data producer maintainers) what an event means
- data compliance: schema is used to track sensitive data (useful in extra senstivie health data)
- data optimizations: schema is used by all storage systems to be more performant and efficent (read: faster and cheaper)
While these are great upsides, the challenge is that they are not felt to the same degree by all the teams involved. Especially when data systems are monkey patched on to legacy systems where there is a tendancy to place the burden on a specifc team instead of distributing it across the system.
So the goal was to introduce schema to events in order to clean up the data tech-debt. From a technical perspective this isn’t hard at all.
My first pass, however, was not successful. I defined a new schema “container” which could be embedded in the exisitng schema-less infrastructure. Downstream systems could recognize this schema and treat it differently than the schema-less events. This captured all the listed upsides, but required the upstream teams to augment their existing data producing clients. It also required teams to “translate” exisiting events into the new schema box. Here is were I really tripped up. In an effort to try and appease all stakeholders, this schema box became more and more complicated which made it less and less valuable (essentially tracking back to schema-less). I realized later that the watered-down schema container made the story more confusing and less compelling for stakeholders.
From a social perspective, each team needed to be convinced on why this extra overhead would help them. Through trial and error, I learned that a different story had to be told to each type of team highlighting the benefits and costs that they understood and would feel most directly. I had a moment of clarity when a data analyst asked me months into the initiative “Why are we even switching from JSON?”. Up to then, I had been telling too general of a story to all stakeholders. This analyst didn’t feel the compliance pains or the costs of storage, they only felt the discovery issues. At this point I realized the first pass was a failed initative having been watered-down and poorly sold to stakeholders.
My second pass was much more successful. Data Platform created a downstream system that if told about any event and it’s schema, it could take if from there. Events that flowed through this system captured all the upsides. This system also allowed producers to mirate data when ready and to define their specific use cases (didn’t try to fit everything in a schema container). The interfaces to the system was kept extremely simple allowing a different, but clear and consise, story for each stakeholder which highlighted how the new system helped them.
By the time I left Fitbit, the schema-less API was still in production, but the schematized system was gaining adoption fast. Schema-less days were numbered.
#data