Calculating revenue differences between days in mySQL

By Eric Koyanagi

Posted on 04/26/24

I often think that I'm no expert with SQL, and that's probably true...but then I see examples of horrible fuckery with an ORM and have to question every bit of my life. I don't want to paint with too broad a brush, but I fear that some devs have avoided learning SQL entirely to lean on ORMs that do that "weird shit" for them.

That's actually "okay" for many CRUD operations -- where ORMs save you immense amounts of time writing boring, boilerplate SQL. But then someone decides to build a report with an ORM and it becomes a massive ball of pointlessness. The thing is, you can do a lot with a basic understanding of SQL, and mixing in just a few "advanced" techniques can give you a ton of power.

Bucketing Data by Time and Purging Duplicate Rows

This is a very common use case, where you need to find revenue or performance data, group it by time, and generate "deltas" that show changes over time. I'm not going to review how a basic GROUP BY works, but sometimes it can be tricky when dealing with actual performance data.

For example, let's say your data has issues with rare duplicated rows and you can't easily fix it in the source data. What you want is to simply group all duplicate rows together. If for some reason there's a field that you can't group by but you only want to extract one value, you can likely use the MIN function.

SELECT 
	account_id, 
    clicks as `clicks`,
    cost_per_click as `cpc`,
    revenue as `revenue`,
    date as `date`,
    COUNT(*) as `cnt`
FROM reports
GROUP BY 
	account_id, 
    clicks,
    cost_per_click,
    revenue, 
    date 
ORDER BY cnt DESC, account_id, DATE ASC

This is hardly a novel trick, but it's worth mentioning. It's also worth being sure in case there's any questions about the integrity of the source data. Maybe the data is being driven from a queue type that guarantees "at least once" delivery, so rare duplicates are actually expected.

You may want to handle this step separately and save the results back into a table. There's a billion other (better) ways to guard against duplicates getting into the database to begin with (uh, hello, unique indexes), but there's also a billion valid use cases for why you might need to deduplicate rows with mySQL. I'm not here to judge. Well, I am, but I don't feel like it right now.

Calculating Revenue Changes per day in mySQL (Window Functions)

The above query helps us "condense" duplicate rows together, which you may not even need. With that out of the way, we need another level of grouping to bring us data grouped by date and account:

WITH grouped_reports as (
  SELECT 
	account_id, 
    SUM(clicks) as `clicks`,
    MIN(cost_per_click) as `cpc`,
    SUM(revenue) as `revenue`,
    date as `date`
FROM reports
GROUP BY 
	account_id, 
    date 
)
SELECT 
	*, 
    ROUND(clicks - LAG(clicks) OVER (ORDER BY date ASC),2) as `deltaClicks`, 
    ROUND(revenue - LAG(revenue) OVER (ORDER BY date ASC),2) as `deltaRevenue` 
    FROM grouped_reports 
ORDER BY account_id, date ASC

First, we use a CTE (mysql 8+) to pull our grouped data. We happen to know that grouping data like this, values like "CPC" will be the same -- but to avoid explicitly grouping, we can use the "MIN" function to extract just the one value from the grouping (technically you could do the same with AVG), but really...we could also just add CPC to the grouping and that would be safer.

We use a CTE here, but it might be better to actually insert the `grouped_reports` table as a real table for persistence so that we have pre-computed, grouped data that's even easier to report back to users.

Either way, the we see how it's possible to operate on other rows in the result set, doing simple math against the grouped "clicks" values to show how these values change over time. You can use the function LEAD instead of LAG if you want to "look" at the next row instead of the previous.

Yes, you can iterate the results in code and do subtraction outside the data layer...but then are you sure that's a good use of your server? Especially if you're dealing with a concurrency model where doing that math might keep a process locked for some time...? (I'm looking at you, PHP)

How LAG and LEAD work

You'll notice the OVER...ORDER BY clause is mandatory in the LAG/LEAD syntax. This is how we define partitions -- the groupings where our window function can operate on. There's a lot we can do with these clauses, but for now all we need to understand is that this works because it works off an "ordered window" and the partition only includes the current/previous rows. We could use this idea to (for example) to include stats grouped to a previous month or year. For example, maybe the performance data needs this broader context to see the account's daily deltas and how it has changed compared to a daily average computed for the last month/year/etc.

Using partitions helps us divide data into specific "windows" that "slide" along the data as we iterate rows -- allowing it to consider things like the previous and next row...but also giving us the power to partition and group data in more clever ways. This allows us to compare data without needing to query for it separately (either with subqueries, separate queries, or CTEs) or do these operations programmatically with code.

This tool can help improve ETL workflows, too, by partitioning data at the SQL level and breaking it into more manageable chunks for further processing.

Conclusion

It's really common to want data like this -- where we want to group it by time and compare it to previous or next rows in the result set. Window functions let us do that even in humble mySQL. For many use cases (but definitely not all), doing this in SQL is a great idea.

« Back to Article List

This Small Corner

Calculating revenue differences between days in mySQL

Bucketing Data by Time and Purging Duplicate Rows

Calculating Revenue Changes per day in mySQL (Window Functions)

How LAG and LEAD work

Conclusion