Using LATERAL Derived Tables To Gather Row-Specific Aggregations In MySQL 8.0.14

By Ben Nadel

Published 2022-03-07 in SQL — Comments (9)

After my database was accidentally upgraded to MySQL 8.0.28, I started to look at the features released in version 8. One fun feature that caught my eye was this idea of a LATERAL derived table. This is a derived table in which the resultant rows are calculated based on other tables in the query. Not only that, the "lateral rows" are based on other rows in recordset. This feature, added in MySQL 8.0.14, will make it easier to gather relevant aggregate data for each row in the results.

In the past, if I needed to pull back a number of related aggregations for a given row, I would JOIN to a derived "stats" table that contained the aggregations for the predicted records in the outer query. The downside to this approach is that - because the derived query can't reference the outer query - I need to duplicate all of the filtering inside of the derived query in order to limit the number of records that it generates.

To see what I mean, let's look at an example using data from this blog. In my database schema, I have two tables:

member - Contains all of the author information for the comments.
blog_comment - Contains all of the comments left by the members.

This is a 1-to-many relationship in which every member can leave N-number of comments (related by blog_comment.member_id). If I want to gather aggregate information about the comments for a given user, I would have to filter records using the member_id in both the outer query and the derived query:

SET @userID = 1;

SELECT
	m.id,
	m.name,
	-- Gather the aggregate data from the derived stats table.
	stats.commentCount,
	stats.minCreatedAt,
	stats.maxCreatedAt
FROM
	member m
INNER JOIN
	-- Generate a DERIVED TABLE with multiple aggregations relating to the
	-- commenting for the given user. This allows us to run the related query
	-- "once for the member" instead of "once for each aggregate".
	(

		SELECT
			c.member_id,
			COUNT( * ) AS commentCount,
			MIN( c.date_created ) AS minCreatedAt,
			MAX( c.date_created ) AS maxCreatedAt
		FROM
			blog_comment c
		-- NOTE: Since the derived query CANNOT reference the outer query - it
		-- has to be a constant evaluation - I have to repeat the filtering in
		-- the derived query in order to avoid a FULL TABLE SCAN. In this case,
		-- it's only one column; but, the more complex the filtering gets, the
		-- more conditions would need to be duplicated.
		WHERE
			c.member_id = @userID
		GROUP BY
			c.member_id

	) AS stats
ON
	(
			m.id = @userID -- Filter OUTER QUERY to given member.
		AND
			stats.member_id = m.id
	)
;

As you can see, I'm using the @userID to limit rows in both the outer query as well as in the derived stats query. Since the derived query cannot reference rows in the outer query - derived queries have to be "constant" in the query execution plan - I have to reproduce all of the desired filtering within the derived query in order to prevent a catastrophic full-table scan.

The new LATERAL derived table removes this point-of-friction and allows our derived query to reference rows in the outer query! All we have to do is include the LATERAL keyword prior to the SELECT. Here's the same query, refactored to use a LATERAL derived table:

SET @userID = 1;

SELECT
	m.id,
	m.name,
	-- Gather the aggregate data from the derived stats table.
	stats.commentCount,
	stats.minCreatedAt,
	stats.maxCreatedAt
FROM
	member m
INNER JOIN
	-- Generate a DERIVED TABLE with multiple aggregations relating to the
	-- commenting for the given user. This allows us to run the related query
	-- "once for the member" instead of "once for each aggregate".
	LATERAL (

		SELECT
			COUNT( * ) AS commentCount,
			MIN( c.date_created ) AS minCreatedAt,
			MAX( c.date_created ) AS maxCreatedAt
		FROM
			blog_comment c
		-- Since we are using a LATERAL derived table, it means that this inner
		-- query is run once per outer-query row. Which means, we can now
		-- reference the row in the outer-query in order to perform the
		-- necessary filtering.
		WHERE
			c.member_id = m.id

	) AS stats
ON
	m.id = @userID -- Filter OUTER QUERY to given member.
;

As you can see, I included the LATERAL keyword just prior to my derived table calculation. This changes the derived table from one that must be constant within the query execution plan to one that will execute once per row in the outer query. This means that our derived table query can reference the outer query which allows us to simplify:

Remove the duplicated filtering reference, @userID, in the derived table query - we can now limit the query based on the condition, c.member_id = m.id.
Remove the INNER JOIN condition, stats.member_id = m.id. Since we know that our LATERAL derived table query is running once per row, there's no need to limit the relationship within the ON clause - the derived table SQL is already applying the JOIN condition, essentially.
Remove the c.member_id column from the derived table query. Since we no longer need to use it within the JOIN condition's ON clause, we no longer need to select the column.
Since we are able to remove the non-aggregate column (c.member_id) from the derived table query, we are also able to remove the GROUP BY and just run the aggregates (COUNT, MIN, MAX) on the entirety of the derived recordset.

Even though this query is relatively low in complexity - we're filtering based on a single ID, using the LATERAL derived table is already simplifying the structure of the SQL. The more complex the query gets - and the more filtering it uses - the more savings we'll see with this new type of per-row cross-product.

`LATERAL` Derived Tables and Performance

The nice thing about a traditional derived table is that it has to be "constant" within the query execution plan. Which means, MySQL can run it once and then cache it for the rest of the query. A LATERAL derived table, on the other hand, can't be cached since it is executed once per row of the outer query. This has some performance implications.

But, that doesn't mean that the performance implications are bad. Consider the use-case problem that the LATERAL derived table is solving for. If you look back up at the first SQL statement that we were executing, we're basically doing the same thing: we're generating a derived table "for each row" in the outer query. Only, instead of doing this with a row-specific reference, we're duplicating the filtering conditions inside the derived table.

So, it's not like the LATERAL derived table is doing any additional "work" - it's just simplifying the SQL statement that we have to write.

`LATERAL` Derived Tables Can Return Multiple Rows

For me, the obvious use-case for this is returning multiple aggregates per row without having to rerun the related queries. But, there's nothing about the LATERAL derived table specification that says the derived table can only return one row. If the inner query returns multiple rows, each derived row will be joined to the outer row the same way any JOIN product works in MySQL.

The LATERAL derived table looks like it will be a MySQL feature that I'll enjoy using. I'm already running SQL queries like this, with derived stats. But, throwing LATERAL JOINs into the mix is just going to make the SQL easier to write and to reason about. I'm pretty excited about this one.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4222

Reader Comments

Charles Robertson Mar 7, 2022 at 5:15 PM

462 Comments

Hi Ben. This looks interesting.
What happens if you take out the LATERAL keyword and just use the clause, inside parentheses, as a traditional SubQuery?
How will the results differ?

Ben Nadel Mar 7, 2022 at 5:31 PM

15,996 Comments

@Charles,

It will throw an error saying that the m.id column in the WHERE clause:

WHERE
	c.member_id = m.id

... is an unknown column. I believe the derived query is kind of factored-out and executed ahead of time. So, in a traditional derived table, there's no member table context.

Charles Robertson Mar 7, 2022 at 5:39 PM

462 Comments

OK. I see. That's really cool.
So with LATERAL, you can access the outer table from within the INNER JOIN SubQuery?
That really opens up some nice possibilities! 👏🏻

Unfortunately, I am still on MySQL 5.6 😂

By the way, I am loving the edit feature on these comments! 👏🏻

Scott Steinbeck Mar 7, 2022 at 6:35 PM

10 Comments

You can also use LATERAL to move calculated column data into a LATERAL statement (at least in Postgres, haven't tried in MySQL).

SELECT a.total_bill
FROM billing_table bt
LATERAL (bt.previous_bill - bt.current_bill as total_bill) a
WHERE a.total_bill > 200

instead of

SELECT bt.previous_bill - bt.current_bill as total_bill
FROM billing_table bt
WHERE bt.previous_bill - bt.current_bill > 200

which removes duplicate logic when using calculated data in multiple places

Charles Robertson Mar 7, 2022 at 6:49 PM

462 Comments

@Scott,

Much cleaner. I like it 👍🏻

Ben Nadel Mar 7, 2022 at 6:51 PM

15,996 Comments

@Scott,

Oh, super interesting!! I'll have to try that after work (currently only have MySQL 5.7 running). I don't think I saw anything in the documentation about using LATERAL without a derived table; but, that could be because I was in the derived table documentation :D I'll have to see.

@Charles,

Thanks! The editing stuff has been really nice! I find so many little mistakes right after I post 🤣

Scott Steinbeck Mar 7, 2022 at 7:35 PM

10 Comments

Ok so I checked it out and Yes you can do this in MySQL. The premise of a LATERAL JOIN is that you are able to use row-level column data in a JOIN. Typically this would be to join another table but it can just as easily work as a temporary table. Here is a real-world example:

SELECT
  b.id as blog_id,
  age_in_years,
  date(
    created_on + (age_in_years + 1) * interval '1 year'
  ) AS next_anniversary,
  date(
    created_on + (age_in_years + 1) * interval '1 year'
  ) - date(now()) AS days_to_next_anniversary
FROM blog b
CROSS JOIN LATERAL (
  SELECT
    cast(
      extract(YEAR FROM age(now(), b.created_on)) AS int
    ) AS age_in_years
) AS t
ORDER BY blog_id

which I took from this post: https://vladmihalcea.com/sql-lateral-join/

Ben Nadel Mar 7, 2022 at 9:31 PM

15,996 Comments

@Scott,

Ahh, I see - so you would use the SELECT to create a temp-table out of some calculated values. That makes sense! Thanks for the tip - I don't think this would have occurred to me.

Ben Nadel Dec 31, 2022 at 9:24 PM

15,996 Comments

@All,

So, in the past, when I've performed a LATERAL join, I've always done it as an INNER JOIN, as in:

INNER JOIN LATERAL

According to the MySQL docs, the LATERAL join should support LEFT OUTER as well; but, when I went to use it, it would throw an error telling me that there was a syntax error in my SQL.

What I finally realized is that the INNER JOIN vs. LEFT OUTER JOIN doesn't seem to make any difference. Meaning, the left hand table is always returned whether or not there are any records in the derived lateral table. As such, I appear to be able to code my LATERAL join using INNER JOIN even when I'm not expecting all of the cross-products to exist.

This is a little confusing when looking at the SQL, kind of expecting the "inner" join to actually enforce a relationships. But, at least in my version of MySQL (8.0.32), this does not seem to work that way.

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.

LATERAL Derived Tables and Performance

LATERAL Derived Tables Can Return Multiple Rows

Reader Comments

`LATERAL` Derived Tables and Performance

`LATERAL` Derived Tables Can Return Multiple Rows