Grouping data by day for reports
When you have raw logs (especially something custom), you sometimes need custom reporting. Isn’t it annoying that those pesky business people want to know what’s going on? It’s also common need day-to-day deltas…which is easy enough to do in excel, but whatever.
WITH grouped_reports as (
SELECT
account_id,
SUM(clicks) as `clicks`,
MIN(cost_per_click) as `cpc`,
SUM(revenue) as `revenue`,
date as `date`
FROM reports
GROUP BY
account_id,
date,
cpc
)
SELECT
*,
ROUND(clicks - LAG(clicks) OVER (ORDER BY date ASC),2) as `deltaClicks`,
ROUND(revenue - LAG(revenue) OVER (ORDER BY date ASC),2) as `deltaRevenue`
FROM grouped_reports
ORDER BY account_id, date ASC
Do we really need a CTE (the with
statement) to group this data? Nah, but it’s nice to remember that it exists.
This is a common use case for data: revenue data bucketed by time in a more raw form, but we want to crunch it, bucket it by day or month, and do some basic math to compare month-to-month or day-to-day performance.
How LAG and LEAD work
Using LAG
and LEAD
lets us do “work” on the previous or next row in the result set. Often, the “hacky” way to do this is to just iterate every row and do any computations server-side. Okay, fine, so long as you understand how this breaks at scale.
For example, if you’re using PHP, every moment the processor iterates a super-duper long for loop, it’s busy. Too busy to serve other requests. Your server can be as slick as butter, then suddenly fail because some silly code is hogging a process.
So yeah, it is worth doing this in SQL.
Even Better: Reduce it Proper
This is okay, but it’ll break at scale, too. Try doing this with a few hundred millions rows or decades of data. To be clear, I’m being sarcastic. I don’t want you to try something like this.
The point here is that the more processing you do outside of your server side code, the better. That’s why a more robust data pipeline wouldn’t just compute all this in SQL. There’s use cases where you need to, but if you have a crap-ton of data, you probably need to reduce it.
Conclusion: Analytics will never Die
People will always throw data into bespoke schemas, then do work to pull it out, again. This task will never die.
There’s a huge range of complexity between crunching crap very crudely in server-side code, implementing more SQL tricks, and reducing data with a robust pipeline — and each of these techniques is useful.