2023.04 Vol.3

Bit of cumulative metrics

Cumulative Metrics

Took a little deep dive in observability land as we upped our monitoring game. We made the switch to OpenTelemtry since we use GCP as our cloud provider and that is actually the client recommendation (was a little surprised there was not “in house” equivalent). The complexity of OpenTelemtry (I guess we call it OTel which is fun) can be overwhelming at first. But having maintained observability systems that grew from “write to this other database” to statsd to Prometheus, I appreciate all the issues OTel is solving for us.

But it is still confusing as hell.

Client --> Collector --> Backend

The measurements path through the monitoring system

In the simplest system, a client lives in an application and shoots metrics over to a backend. The backend handles merging the streams coming in from multiple clients, long-term storage, and visualizing the metrics. The client focuses on making it easy of for devs to create valuable metrics, low performance overhead for the app, and avoiding dropped data.

The requirements for the storage of metrics are pretty different than the ones for generating them and are usually at odds with each other. For example, if the client has to be aware of how metrics are serialized in the backed, it usually makes it harder for developers to easily create valuable metrics since now they have to also consider how the metric is being stored. To avoid this, it’s justified to bust up the system, create some abstractions, and allow components to specialize.

OTel settled on three main components: the client, the collector (sometimes called exporter in specific instances), and the backend. The client and backend are completely separated, allowing them each to focus on their requirements. OTel does a good enough job of this that the backend is interchangeable. A stack could write to Prometheus one day and switch to GCP Monitoring the next. The cost of this abstraction is the new component: the collector. The collector handles buffering multiple streams of data and converting the measurements from OTel models to the backend equivalents. This is what allows the client to focus on just getting measurements to the collector and the backend to focus on storing them.

While this is a great abstraction it does cause some confusing scenarios.

OTel (and monitoring systems in general it seems these days) supports three types of metrics: guage, delta, and cumulative.

  • guage – “At x time the value was y”, “At 2023-04-24 the temperature was 73 degrees”
  • delta – “Between x and y the value changed z”, “Between 2023-04-25 and 2023-04-26 it rained 2 inches”
  • cumulative – “Since x the value is y”, “Since “2023-01-01 it has rained 42 inches”

It is pretty straightforward to think of monitoring scenarios for these three types. What might not be as obvious is that you can convert metrics from one kind to another. And what really wasn’t obvious to me is that it usually makes sense to convert a metric’s kind between the three main system components (client, collector, and backend). Don’t assume that just because you define what seems to be a clear-cut delta-like metric in the client that it’s going to be stored as a delta in the backend, but also don’t worry! Cause you can always convert it back for visualizing or alerts.

I think the delta to cumulative scenarios are the most interesting (plus, the ones that tripped me up). Let’s say you create a “counter” metric in the client in order to count all the errors of a certain function. The counter has a simple one-function interface, Add(x), which is used to add a measurement to the counter, for example, add one more error. It makes sense that the client takes a delta strategy, sending delta measurements to the collector. The collector though converts it into a cumulative. Why? The collector is optimizing for data reliability. Let’s say you have 100 clients dumping delta metrics into the collector and the collector goes down. If the collector is just forwarding along a combined delta stream to the backend then there will be a big window of lost data until the collector comes back online. But if the collector is using a cumulative metric, it’s possible for the backend to approximate the data during the outage window (“we don’t know exactly when these 30 errors happened, but it was within this window”). The backend can then “unravel” the cumulative metric into a delta metric for easier understanding for visualizations and alerts (there might be a bit of accuracy lost in this translation, but usually we would trade that for no massive data loss). Don’t fear the cumulative.