Analytics Aggregation

// #Data

Gaining analytic insights from data often comes down to counting events in a time series. For example, “how many times did event A happen this month after we enabled feature X?”. There are infinite number of questions to ask, but when the CPUs start churning and the bytes moving, it’s almost always just counting things.

Big Data (re: buzz word for offline, aggregation) analytic tools like Data Studio often make it obvious that they are trying to make it easier to model these questions. Easier than writing raw SQL on a data warehouse. Often times though this goal is hindered by a second motive: making it cheaper to ask these questions.

The number one way to make analytic queries cheaper is to scan less data. Since analytic events are almost always a time series, its pretty straight forward to follow this principle. If we want to know how often event A happened this month, we shouldn’t bother reading any data before that which wastes CPU cycles (aka money). In a SQL query, this is as simple as a WHERE clause on a datetime column. Things get more opaque at the abstract tooling level. At this level, a user just cares about finding their insight, not as much about how the tool is going to do it as cheap as possible. So the tool ask questions to run the query more efficiently, but in an attempt to stay our of the user’s way, it often ends up being more confusing than just coming straight out and sayin “Hey, we don’t want to query all this crap, so what data should we cut out?”.

Lots of times there are multiple ways to do something in these complex analytic tools and its not clear which way should be preferred. Whenever I run into this I just go the route that I think will help the tool cut out more data.

In Data Studio, a data source can be a BigQuery table OR a custom query. Which should I use? For context, Data Studio allows users to visualize data and exposes knobs to update the graphs in real time. What did this event look like last month? What about last year? If these graphs are backed by a static custom query data source which queries the same data no matter what, then CPU cycles are probably being wasted. If the graphs are instead backed by a table and Data Studio is able to issue optimized queries to cut out data, cycles are being saved. So unless a custom query can be modified to scan less data, a table data source should be preferred over a custom query.