2021.01 Vol.3

Bit of coordinated omission

Are you sure that we are awake? It seems to me that yet we sleep, we dream.

Performance testing is a roller coaster of emotion. The highs of a service looking invincible after it survives the load generator. The lows of that service instantly dying in production.

One of the performance test gotchas that I have fallen for (a lot) is coordinated omission.

Performance Principles

Latency is the time it took an operation to happen. Each operation has its own latency. When testing the performance of a system, we care about all latencies, not some “common case” subset.

We often justify looking at the 95th percentile (or lower) because that is the “common case”. However, when a user interacts with a system, a session usually involves multiple requests. For example, web pages will hit a server hundreds of times. What are the chances all requests are in the 95th percentile? If one of the requests is outside of the 95th, how much does it dominate the session? Chances are, a lot. The higher percentiles determine the quality of a session and are what performance tests should focus on.

What are You Measuring?

Coordinated omission involves measuring service time when you meant to measure request time.

Service time is the amount of time spent doing work for a request. Request time is the amount of time a user waited for a request to be complete. It is service time + any time spent queued up.

The following ascii art represents a load generator making request to a system over time. Each request is represented by an x. The _ represent when the system goes down, for whatever reason, and the load generator is waiting for the last request it sent to return:

xxxxxxx_______xxxxx

When the load testing is complete, the data shows one really slow request and the rest really fast. The 95th percentile looks spectacular. This data is a good representation of service time, but a poor representation of request time. When this service hits production and goes down again, a lot more requests are going to be affected by the outage than what was captured by the load generator. The load generator was “coordinating” with the system to hide the system’s faults. And it hides the cases we care about the most: the high percentiles.

A sign that your performance data might contain coordinated omission is a hockey stick like percentile graph.

Appendix

The red pill.