Playing With Window Functions In MySQL 8
Earlier this week, I went through the High Performance SQLite course by Aaron Francis. Among the many SQL topics that he covered, he included something that I've never tried before: Window Functions. These functions allow you to calculate data based on the current row. These functions behave somewhat like a subquery that is locked-down to the existing result-set. And, which can be further subdivided using a partition column.
To start exploring these Window Functions in MySQL, I'm going to use a common table expression (CTE) of email addresses. The derived table will be constructed using the super awesome VALUES
/ROWS
feature of MySQL:
WITH
email_list ( email ) AS (
VALUES
ROW ( 'ben@alpha.example.com' ),
ROW ( 'kit@alpha.example.com' ),
ROW ( 'joanna@beta.example.com' ),
ROW ( 'sam@charlie.example.com' ),
ROW ( 'molly@charlie.example.com' ),
ROW ( 'steve@charlie.example.com' ),
ROW ( 'bob@charlie.example.com' ),
ROW ( 'jan@beta.example.com' ),
ROW ( 'laura@alpha.example.com' ),
ROW ( 'hal@delta.example.com' )
)
Now that we have our email addresses in the CTE table, email_list
, let's add a generated id
column. For this, we'll use the ROW_NUMBER()
window function. ROW_NUMBER()
returns the number of the current row (1-based) within the current partition. However, since we want this number to be unique across the entire list of emails, we'll omit a partition column and only provide an ORDER BY
clause:
ROW_NUMBER() OVER ( ORDER BY email ASC )
The OVER
keyword indicates that the ROW_NUMBER()
is going to be calculated over a set of related rows (ie, that this is a window-based function). Then, the ORDER BY
defines how each set of related rows should be sorted before the ROW_NUMBER()
is applied.
For good measure, we're also going to calculate the email domain for use in a subsequent window function:
WITH
email_list ( email ) AS (
/* .. */
),
user AS (
SELECT
ROW_NUMBER() OVER ( ORDER BY email ASC ) AS id,
email,
SUBSTRING_INDEX( email, '@', -1 ) AS domain
FROM
email_list
)
SELECT
*
FROM
user
;
When we SELECT
from this intermediary table, we get the following output:
As you can see, the ROW_NUMBER()
window function gave us an incrementing ID across the entire set of email addresses.
Now, let's use another window function to calculate the ID of each user locally within the relevant email domain. For this, we'll use the DENSE_RANK()
window function:
DENSE_RANK() OVER ( PARTITION BY domain ORDER BY email ASC )
This is essentially like a ROW_NUMBER()
function that resets within each partition. The PARTITION BY domain
tells MySQL that within the result-set we're going to group rows by domain
first. Then, within each group, we're going to first sort the rows by email ASC
and then apply the DENSE_RANK()
window function.
Aside: In fact, using the same
PARTITION BY
andORDER BY
, theROW_NUMBER()
window function would give us the same result as theDENSE_RANK()
window function. But, it's more fun to try a different function.
WITH
email_list ( email ) AS (
/* .. */
),
user AS (
/* .. */
),
ranked AS (
SELECT
u.id,
u.email,
u.domain,
DENSE_RANK() OVER ( PARTITION BY domain ORDER BY email ASC ) AS domainID
FROM
user u
)
SELECT
r.domain,
r.domainID,
r.id,
r.email
FROM
ranked r
ORDER BY
r.domain ASC,
r.domainID
;
If we now SELECT
from this ranked
table, we get the following output:
As you can see, the DENSE_RANK()
window function gave us a domain-specific incrementing ID for each partitioned set of users.
Window functions seem very powerful. I doubt that I'll use these function in my core application CRUD (Create, Read, Update, Delete) queries. But, I suspect that—much like the common table expression—they will be very helpful in one-off analytics and report-style queries.
Want to use code from this post? Check out the license.
Reader Comments
Hmmm, really glad you're delving deeper into this topic. I've never really grepped windows functions very well and my understanding was that they're most useful dealing with rolling averages and ranking search results.
@Chris,
Yeah, I read a little about the rolling average stuff in the MySQL docs. I think a lot of this stuff, too, is something that I would normally do in my application code after pulling the records back into the ColdFusion app. After all, you're still pulling back all the records - it's not like a traditional app where you're rolling the rows up into the aggregates; with the window functions you're still bring back all the rows; only, you're also asking the DB to perform some additional calculations.
I hear a lot of people say that you should ask the DB to do the stuff that it's good at - like crunching numbers. But, I don't think it's that simple. You only have 1 DB (usually); but, you might have many application instances. By moving the calculations into the application server, you are distributed the load.
As always, there's no one way to do it, just a different set of trade-offs.
@Ben Nadel,
Agreed, there are no perfect solutions...only trade-offs! I have had one business case for a windows function that would have been a lot more effort in the app code. I wish I could recall the circumstance, but alas...I do not :(
And if/when I ever need to do rolling averages again, I think I'll try to lean on these windows functions more...if only to understand them better.
Great article! 🙌
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →