Using Multiple Common Table Expressions In One SQL Query In MySQL
A couple of years ago, I upgraded this blog from MySQL 5.7.10 to MySQL 8.0.28. This gave me access to newer SQL features like Common Table Expressions (CTE) and JSON column types. But, my blog requires little more that CRUD (Create, Read, Update, Delete) operations; so, there's not much of need for me to write advanced SQL statements.
At work, however, we recently upgraded to MySQL 8. And this finally gave me an opportunity to start playing around with more advanced concepts due to the large number of one-off reports that I have to run. A couple of weeks ago, I looked at using VALUES
and ROW
constructs to create a common table expression. And, as a follow-up to that, I just learned that you can have multiple CTEs within a single MySQL query. And, that these CTEs can reference each other. So freaking cool!
To demonstrate, I'm going to build-up a SQL query using common table expressions. First, we'll start with a list of fake email addresses. This is something that I often have to do, copy-pasting them from a CSV (Comma Separated Values) file. The following CTE just gets the email addresses into a consumable format:
WITH
emails ( `email` ) AS (
VALUES
ROW( 'laura@acme.example' ),
ROW( 'tim@acme.example' ),
ROW( 'anna@masters.example' ),
ROW( 'sonya@vids.example' ),
ROW( 'robert@acme.example' ),
ROW( 'alice@vids.example' ),
ROW( 'martha@chomp.example' ),
ROW( 'douglas@chomp.example' )
)
This creates a derived table, emails
, with a single column, email
. Common table expressions can be referenced by the main SQL statement; but, they can also be referenced by other CTEs in the same query. And that's exactly what we'll do next—create another CTE that builds upon the prior CTE and extracts the domain from each row:
WITH
-- ... truncated SQL query ...
deconstructed ( `email`, `domain` ) AS (
SELECT
e.email,
SUBSTRING_INDEX( e.email, '@', -1 ) AS domain
FROM
emails e -- CTE reference!
)
As you can see, this CTE is querying from the previous CTE, emails
. Then, it's using the SUBSTRING_INDEX()
function to parse the domain out of each email, creating yet another derived table / CTE.
Next, we'll use this new CTE to create another CTE which groups the emails by domain and records the COUNT()
statistics:
WITH
-- ... truncated SQL query ...
stats ( `domain`, `emailCount` ) AS (
SELECT
d.domain,
COUNT( * ) AS emailCount
FROM
deconstructed d -- CTE reference!
GROUP BY
d.domain
)
As you can see, this CTE is querying from the previous CTE, deconstructed
.
Next, we'll create another CTE which again references the previous CTE, deconstructed
. But, this time, instead of getting the count, we'll collect the email addresses—per domain—into a JSON aggregation.
Note: I could have combined the following CTE with the previous CTE and used both the
COUNT(*)
and theJSON_ARRAYAGG()
in the same query; but, splitting them up allowed me to explore the space a bit more.
WITH
-- ... truncated SQL query ...
aggregated ( `domain`, `collection` ) AS (
SELECT
d.domain,
JSON_ARRAYAGG( d.email ) AS collection
FROM
deconstructed d -- CTE reference!
GROUP BY
d.domain
)
So far, we've only created CTEs that reference other CTEs. But, these CTEs merely setup utility tables, they don't actually return data to the client. Now it's time to write the actual SQL that returns actual data. This SQL will combine multiple CTEs from above:
WITH
-- ... truncated SQL query ...
SELECT
s.domain,
s.emailCount,
a.collection
FROM
stats s -- CTE reference!
INNER JOIN
aggregated a -- CTE reference!
ON
a.domain = s.domain
ORDER BY
s.emailCount DESC,
s.domain ASC
;
As you can see, we're taking two of the CTEs, stats
and aggregated
, and we're JOIN
ing them together.
It's just CTEs all the way down! We've looked at the individual parts in isolation. Now, here's the entire SQL query in one query:
WITH
-- First, we'll start with a common table expression (CTE) for the email addresses.
emails ( `email` ) AS (
VALUES
ROW( 'laura@acme.example' ),
ROW( 'tim@acme.example' ),
ROW( 'anna@masters.example' ),
ROW( 'sonya@vids.example' ),
ROW( 'robert@acme.example' ),
ROW( 'alice@vids.example' ),
ROW( 'martha@chomp.example' ),
ROW( 'douglas@chomp.example' )
),
-- Second, we'll create a common table expression (CTE) that extracts the domain from
-- the email address.
deconstructed ( `email`, `domain` ) AS (
SELECT
e.email,
SUBSTRING_INDEX( e.email, '@', -1 ) AS domain
FROM
emails e -- CTE reference!
),
-- Third, we'll create a common table expression (CTE) that provides some stats for
-- how many emails belong to each domain.
stats ( `domain`, `emailCount` ) AS (
SELECT
d.domain,
COUNT( * ) AS emailCount
FROM
deconstructed d -- CTE reference!
GROUP BY
d.domain
),
-- Fourth, we'll create a common table expression (CTE) that groups the emails by
-- domain and provides the collection of emails as a JSON payload.
aggregated ( `domain`, `collection` ) AS (
SELECT
d.domain,
JSON_ARRAYAGG( d.email ) AS collection
FROM
deconstructed d -- CTE reference!
GROUP BY
d.domain
)
SELECT
s.domain,
s.emailCount,
a.collection
FROM
stats s -- CTE reference!
INNER JOIN
aggregated a -- CTE reference!
ON
a.domain = s.domain
ORDER BY
s.emailCount DESC,
s.domain ASC
;
And, when we run this MySQL query, we get the following output:
When it comes to writing basic business applications, I don't need these sassy SQL mechanics. But, when it comes to reporting and other types of one-off data introspection tasks, features like VALUES
/ROW
, iteratively enhanced common table expressions, and JSON aggregations are just amazing! SQL is a truly wonderful language.
Want to use code from this post? Check out the license.
Reader Comments
One powerful use of CTEs in queries is to recursion. For example, if you have a table that has some kind of hierarchy, you can use CTEs to walk up or down the hierarchy from a defined starting point.
@Carl,
I've seen the recursive feature in the docs, but I've never actually tried it myself. The syntax looks a little strange; but, I imagine it makes some complex data relationships possible to query.
Historically, when I have a parent-child relationship to pull back, I'll just pull back all the "nodes" back, and then build the relationship in the ColdFusion code itself. I should take one of those scenarios and try to rework it use SQL recursion to see how it feels.
Just checking that this is a 'forced example on purpose', given the same result can be achieved with a simple GROUP BY on the calculated domain column;
@Ian,
I will grant that my example is more contrived than is necessary, in order to show that multiple CTEs can be used and chained together. That said, I do think that some degree of breaking things apart makes it more readable / understandable.
For example, in your query, it almost feels like a chicken-and-egg problem. Your
GROUP BY
depends on the calculation ofdomain
in theSELECT
. But then, yourCOUNT(*)
depends on the grouping of the domains... which depends on theSELECT
... which depends on theGROUP BY
, and so on 😄I'm actually a little surprised that is valid (which it clearly does given your output).
So, long story short, yes, it's more contrived than it needed to be; but, not as contrived as I think you think it is. But, it's certainly a matter of subjectivity.
@Ben Nadel,
The reason it 'works' in MySQL (this is a MySQL specific article) is that MySQL doesn't insist you include all the non-aggregate columns in the GROUP BY. My query wouldn't work in e.g. Postgres, as Postgres would insist I also included e.email in my GROUP BY, which would stop it grouping at the domain level.
We're already in the 'you're not porting that puppy to A.N. Other RDMS without some headscratching' territory, so that little detail aside...
My select is just a bog-standard 'column plus aggregates'. OK, that column is 'calculated', but it's a simple row level calc.
So, I'm not suprised that you're suprised it works, but I am suprised, given it does, that you think it's subjective how close/far apart these two queries are in 'understandability'. I think anyone (with any SQL knowledge!) would immediately see what my query is doing (even if they too were suprised that it works). For all its explicatory pluses, yours is not a 'glance and I've got it' query!
Hey ho, on we go...
@Ian,
Totally, at a glance, it's not too hard to understand what it is intending to do. I think one can easily pattern-match on the
GROUP BY
+COUNT
pattern and not actually think too hard about the underlying mechanics.That said, I've often written SQL that I think should work, only be yelled at by the query analyzer 🙃
@Ben Nadel,
Mine is a query I'd hope would work, as for me it's the most direct/understable conversion from English to SQL, given the output requirements. And I'd point out the (easily inferred) intention matches the result!
In other universe, version 8.5 of MySQL removes the requirement to include all non-aggregate columns in a GROUP BY and I'm writing an article about how you can now get rid of all those obfuscatory CTEs😝
@Ian,
Oh very cool - please cross-post it here when you have something to read. 🙌
Another thought I just had, and this is a general one about CTEs, not necessarily related to this specific query. I would guess that a CTE makes it easier to debug a query because it gives you an intermediary result that you can output (ie,
SELECT
on). I draw this parallel to regular programming where having intermediary variables makes break-points and debugging a bit easier to manage.@Ben Nadel,
sorry Ben - my article comment was tongue in cheek.
I was imagining a universe where MySQL's ability to not have to include all non-aggregate columns in the GROUP BY didn't exist currently, and was then brought into an upcoming version. Thereby allowing calculated columns to work in GROUP BY clauses, and thus negating the need for CTEs in your original query, and allowing my query to work.
So, I'd be treating this as a great new feature, in a way you're doing with CTE's here. I was, in short, 'doing an irony'
@Ian,
Ha ha, I can dig it - I've got Monday brain :) That said, I do absolutely love that MySQL doesn't require all columns to be in the
GROUP BY
. I came originally from a Microsoft SQL Server as my first DB tech, which needed all the columns. The relaxed nature of MySQL, in that regard, is so much nicer.Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →