Using LATERAL Derived Tables To Gather Row-Specific Aggregations In MySQL 8.0.14
After my database was accidentally upgraded to MySQL 8.0.28, I started to look at the features released in version 8. One fun feature that caught my eye was this idea of a LATERAL
derived table. This is a derived table in which the resultant rows are calculated based on other tables in the query. Not only that, the "lateral rows" are based on other rows in recordset. This feature, added in MySQL 8.0.14, will make it easier to gather relevant aggregate data for each row in the results.
In the past, if I needed to pull back a number of related aggregations for a given row, I would JOIN
to a derived "stats" table that contained the aggregations for the predicted records in the outer query. The downside to this approach is that - because the derived query can't reference the outer query - I need to duplicate all of the filtering inside of the derived query in order to limit the number of records that it generates.
To see what I mean, let's look at an example using data from this blog. In my database schema, I have two tables:
member
- Contains all of the author information for the comments.blog_comment
- Contains all of the comments left by the members.
This is a 1-to-many relationship in which every member can leave N-number of comments (related by blog_comment.member_id
). If I want to gather aggregate information about the comments for a given user, I would have to filter records using the member_id
in both the outer query and the derived query:
SET @userID = 1;
SELECT
m.id,
m.name,
-- Gather the aggregate data from the derived stats table.
stats.commentCount,
stats.minCreatedAt,
stats.maxCreatedAt
FROM
member m
INNER JOIN
-- Generate a DERIVED TABLE with multiple aggregations relating to the
-- commenting for the given user. This allows us to run the related query
-- "once for the member" instead of "once for each aggregate".
(
SELECT
c.member_id,
COUNT( * ) AS commentCount,
MIN( c.date_created ) AS minCreatedAt,
MAX( c.date_created ) AS maxCreatedAt
FROM
blog_comment c
-- NOTE: Since the derived query CANNOT reference the outer query - it
-- has to be a constant evaluation - I have to repeat the filtering in
-- the derived query in order to avoid a FULL TABLE SCAN. In this case,
-- it's only one column; but, the more complex the filtering gets, the
-- more conditions would need to be duplicated.
WHERE
c.member_id = @userID
GROUP BY
c.member_id
) AS stats
ON
(
m.id = @userID -- Filter OUTER QUERY to given member.
AND
stats.member_id = m.id
)
;
As you can see, I'm using the @userID
to limit rows in both the outer query as well as in the derived stats
query. Since the derived query cannot reference rows in the outer query - derived queries have to be "constant" in the query execution plan - I have to reproduce all of the desired filtering within the derived query in order to prevent a catastrophic full-table scan.
The new LATERAL
derived table removes this point-of-friction and allows our derived query to reference rows in the outer query! All we have to do is include the LATERAL
keyword prior to the SELECT
. Here's the same query, refactored to use a LATERAL
derived table:
SET @userID = 1;
SELECT
m.id,
m.name,
-- Gather the aggregate data from the derived stats table.
stats.commentCount,
stats.minCreatedAt,
stats.maxCreatedAt
FROM
member m
INNER JOIN
-- Generate a DERIVED TABLE with multiple aggregations relating to the
-- commenting for the given user. This allows us to run the related query
-- "once for the member" instead of "once for each aggregate".
LATERAL (
SELECT
COUNT( * ) AS commentCount,
MIN( c.date_created ) AS minCreatedAt,
MAX( c.date_created ) AS maxCreatedAt
FROM
blog_comment c
-- Since we are using a LATERAL derived table, it means that this inner
-- query is run once per outer-query row. Which means, we can now
-- reference the row in the outer-query in order to perform the
-- necessary filtering.
WHERE
c.member_id = m.id
) AS stats
ON
m.id = @userID -- Filter OUTER QUERY to given member.
;
As you can see, I included the LATERAL
keyword just prior to my derived table calculation. This changes the derived table from one that must be constant within the query execution plan to one that will execute once per row in the outer query. This means that our derived table query can reference the outer query which allows us to simplify:
Remove the duplicated filtering reference,
@userID
, in the derived table query - we can now limit the query based on the condition,c.member_id = m.id
.Remove the
INNER JOIN
condition,stats.member_id = m.id
. Since we know that ourLATERAL
derived table query is running once per row, there's no need to limit the relationship within theON
clause - the derived table SQL is already applying theJOIN
condition, essentially.Remove the
c.member_id
column from the derived table query. Since we no longer need to use it within theJOIN
condition'sON
clause, we no longer need to select the column.Since we are able to remove the non-aggregate column (
c.member_id
) from the derived table query, we are also able to remove theGROUP BY
and just run the aggregates (COUNT
,MIN
,MAX
) on the entirety of the derived recordset.
Even though this query is relatively low in complexity - we're filtering based on a single ID, using the LATERAL
derived table is already simplifying the structure of the SQL. The more complex the query gets - and the more filtering it uses - the more savings we'll see with this new type of per-row cross-product.
LATERAL
Derived Tables and Performance
The nice thing about a traditional derived table is that it has to be "constant" within the query execution plan. Which means, MySQL can run it once and then cache it for the rest of the query. A LATERAL
derived table, on the other hand, can't be cached since it is executed once per row of the outer query. This has some performance implications.
But, that doesn't mean that the performance implications are bad. Consider the use-case problem that the LATERAL
derived table is solving for. If you look back up at the first SQL statement that we were executing, we're basically doing the same thing: we're generating a derived table "for each row" in the outer query. Only, instead of doing this with a row-specific reference, we're duplicating the filtering conditions inside the derived table.
So, it's not like the LATERAL
derived table is doing any additional "work" - it's just simplifying the SQL statement that we have to write.
LATERAL
Derived Tables Can Return Multiple Rows
For me, the obvious use-case for this is returning multiple aggregates per row without having to rerun the related queries. But, there's nothing about the LATERAL
derived table specification that says the derived table can only return one row. If the inner query returns multiple rows, each derived row will be joined to the outer row the same way any JOIN
product works in MySQL.
The LATERAL
derived table looks like it will be a MySQL feature that I'll enjoy using. I'm already running SQL queries like this, with derived stats. But, throwing LATERAL
JOIN
s into the mix is just going to make the SQL easier to write and to reason about. I'm pretty excited about this one.
Want to use code from this post? Check out the license.
Reader Comments
Hi Ben. This looks interesting.
What happens if you take out the LATERAL keyword and just use the clause, inside parentheses, as a traditional SubQuery?
How will the results differ?
@Charles,
It will throw an error saying that the
m.id
column in theWHERE
clause:... is an unknown column. I believe the derived query is kind of factored-out and executed ahead of time. So, in a traditional derived table, there's no
member
table context.OK. I see. That's really cool.
So with LATERAL, you can access the outer table from within the INNER JOIN SubQuery?
That really opens up some nice possibilities! 👏🏻
Unfortunately, I am still on MySQL 5.6 😂
By the way, I am loving the edit feature on these comments! 👏🏻
You can also use
LATERAL
to move calculated column data into aLATERAL
statement (at least in Postgres, haven't tried in MySQL).instead of
which removes duplicate logic when using calculated data in multiple places
@Scott,
Much cleaner. I like it 👍🏻
@Scott,
Oh, super interesting!! I'll have to try that after work (currently only have MySQL 5.7 running). I don't think I saw anything in the documentation about using
LATERAL
without a derived table; but, that could be because I was in the derived table documentation :D I'll have to see.@Charles,
Thanks! The editing stuff has been really nice! I find so many little mistakes right after I post 🤣
Ok so I checked it out and Yes you can do this in MySQL. The premise of a
LATERAL JOIN
is that you are able to use row-level column data in a JOIN. Typically this would be to join another table but it can just as easily work as a temporary table. Here is a real-world example:which I took from this post: https://vladmihalcea.com/sql-lateral-join/
@Scott,
Ahh, I see - so you would use the
SELECT
to create a temp-table out of some calculated values. That makes sense! Thanks for the tip - I don't think this would have occurred to me.@All,
So, in the past, when I've performed a
LATERAL
join, I've always done it as anINNER JOIN
, as in:INNER JOIN LATERAL
According to the MySQL docs, the
LATERAL
join should supportLEFT OUTER
as well; but, when I went to use it, it would throw an error telling me that there was a syntax error in my SQL.What I finally realized is that the
INNER JOIN
vs.LEFT OUTER JOIN
doesn't seem to make any difference. Meaning, the left hand table is always returned whether or not there are any records in the derived lateral table. As such, I appear to be able to code myLATERAL
join usingINNER JOIN
even when I'm not expecting all of the cross-products to exist.This is a little confusing when looking at the SQL, kind of expecting the "inner" join to actually enforce a relationships. But, at least in my version of MySQL (8.0.32), this does not seem to work that way.
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →