MySQL Query Optimization + Forgetting To Run EXPLAIN = Full Table Scan
Yesterday, while editing some ColdFusion code for this blog, I had a face-palm moment: I came across a query - the "recent comments" query on the homepage - that was almost certainly performing a full table scan of comments. At first glance, I thought I must have forgotten to check the performance characters of the SQL query with EXPLAIN
. But, as I began to tweak the SQL query, I realized what happened: an earlier version of the query was using indexes; but then, I made a small change that completely altered MySQL's query execution. I love how clever the MySQL query optimizer is sometimes; so, I thought this might be worth a closer look.
If you go to the homepage of this blog, you'll see a "Recently Posted Comments" section that shows the 10 most recent comments along with the "member" that posted the comment and the "entry" on which the comment was posted. This data is gathered across three tables:
blog_comment
blog_entry
member
While there is nothing of particular surprise in the blog_comment
and member
tables, the blog_entry
table has an isActive
boolean. This boolean allows me to take blog posts offline without actually deleting them. But, when I first authored this query, I suspect that I had forgotten about the isActive
check and left it out.
Let's take a quick look at that incomplete query first so that we can see exactly where I messed up. Here's a truncated version of the SQL query without the isActive
check (SELECT
statement is truncated because it's not relevant to the conversation):
SELECT
/* ... truncated ... */
FROM
blog_comment c
INNER JOIN
blog_entry b
ON
b.id = c.blogEntryID
INNER JOIN
member m
ON
m.id = c.memberID
ORDER BY
c.id DESC
LIMIT
10
As you can see, it's a relatively straightforward SQL query. I'm defining the JOIN
conditions using best practices: data that I have is on the right, data that I need is on the left. Then I'm using ORDER BY id DESC
in lieu of date-based sorting in order to get the most recent comments.
In fact, this SQL query is so simple that the MySQL query optimizer does something very clever! It knows that it has to do a full table scan; but, it also knows that it doesn't have to do any filtering; so, it attempts to run the query "backwards," so to speak.
Let's look at the EXPLAIN
output for this query:
mysql> EXPLAIN SELECT .... ;
+----+-------------+-------+------------+--------+----------------------+---------+---------+------------------------+------+----------+---------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+----------------------+---------+---------+------------------------+------+----------+---------------------+
| 1 | SIMPLE | c | NULL | index | IX_join,IX_by_member | PRIMARY | 4 | NULL | 10 | 100.00 | Backward index scan |
| 1 | SIMPLE | b | NULL | eq_ref | PRIMARY | PRIMARY | 4 | bennadel.c.blogEntryID | 1 | 100.00 | NULL |
| 1 | SIMPLE | m | NULL | eq_ref | PRIMARY,IX_info | PRIMARY | 4 | bennadel.c.memberID | 1 | 100.00 | NULL |
+----+-------------+-------+------------+--------+----------------------+---------+---------+------------------------+------+----------+---------------------+
3 rows in set, 1 warning (0.00 sec)
Notice that the blog_comment
(alias c
) only has to scan 10 rows. In the Extra
column, it states:
Backward index scan
Essentially what (I believe) is happening here is that the query execution is just scanning the primary key index on the blog_comment
table backwards, grabbing the "first" 10-rows, and then performing the JOIN
conditions. Basically, it's merging the requirements of ORDER BY
into the ON
condition of the first JOIN
.
To be clear, this is a query optimization that is happening behind the scenes - it is not what the query would be doing if it were naively executing the JOIN
.
I think about SQL JOIN
s in the same way that I think about using .map()
and .filter()
methods in ColdFusion and JavaScript. Meaning that the result-set of a cross-table query is calculated one JOIN
at a time; and then, the materialized results of that JOIN
are passed-down to the next JOIN
. I don't know how accurate this mental model is in relation to the nested-loop join algorithm that MySQL uses; but, it keeps it simple and helps me think about index structures and performance.
Accepting that I did run an EXPLAIN
on my original query and saw that it was scanning 10-rows, given my mental model for how JOIN
s work, I likely glossed-over the whole "backwards index scan" optimization and just assumed that the query was using standard JOIN
mechanics. Which means, when I went to add the missing isActive=1
condition to my ON
clause, I assumed that I was doing little more than updating the cross-product calculation that was already in place.
As such, I likely assumed that I didn't have to re-run the EXPLAIN
on the updated SQL statement.
However, by adding the isActive
condition, I was negating the ability for MySQL to apply the "backwards index scan" optimization, which fundamentally changed the query plan. Here's the updated SQL statement:
SELECT
/* ... truncated ... */
FROM
blog_comment c
INNER JOIN
blog_entry b
ON
(
b.id = c.blogEntryID
AND
b.isActive = 1 -- Active posts only!
)
INNER JOIN
member m
ON
m.id = c.memberID
ORDER BY
c.id DESC
LIMIT
10
;
All I did was add one condition to the first ON
clause. However, if we now run an EXPLAIN
on this, we get a radically different result from our earlier EXPLAIN
:
mysql> EXPLAIN SELECT .... ;
+----+-------------+-------+------------+--------+----------------------+---------+---------+---------------------+------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+----------------------+---------+---------+---------------------+------+----------+----------------------------------------------+
| 1 | SIMPLE | b | NULL | ALL | PRIMARY | NULL | NULL | NULL | 2605 | 50.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | c | NULL | ref | IX_join,IX_by_member | IX_join | 4 | bennadel.b.id | 14 | 100.00 | NULL |
| 1 | SIMPLE | m | NULL | eq_ref | PRIMARY,IX_info | PRIMARY | 4 | bennadel.c.memberID | 1 | 100.00 | NULL |
+----+-------------+-------+------------+--------+----------------------+---------+---------+---------------------+------+----------+----------------------------------------------+
3 rows in set, 1 warning (0.00 sec)
This time, instead of the query scanning the blog_comment
table first, it performs a full table scan of the blog_entry
table in order limit the results to where the isActive=1
condition can be satisfied.
The fundamental problem here is that my SQL query didn't have a good way to limit the cross-product between the tables as they were being calculated. To remedy this, I can break the query into two parts:
Find an appropriate comment ID that represents the earliest of the "recent comments" - a "cut-off" ID.
Use the cut-off ID to limit the cross-product of the subsequent
JOIN
.
The following SQL query isn't exactly equivalent to the one above; but, we'll get into that shortly:
/**
* Let's get the 11th-from-last comment ID that we can use to drastically limit
* the INNER JOIN condition in the subsequent query. Since this query is only
* inspecting the primary key of the table, it can get this ID instantly using
* the backward index scan approach.
*/
SET @cutoffID = (
SELECT
c.id
FROM
blog_comment c
ORDER BY
c.id DESC
LIMIT
1
OFFSET
10 -- We want the most recent 10 comments, skip over first 10 IDs.
);
SELECT
/* ... truncated ... */
FROM
blog_comment c
INNER JOIN
blog_entry b
ON
(
c.id > @cutoffID -- !! USING THE CUT-OFF ID TO LIMIT CROSS-PRODUCT !!
AND
b.id = c.blogEntryID
AND
b.isActive = 1
)
INNER JOIN
member m
ON
m.id = c.memberID
ORDER BY
c.id DESC
;
Now, when we run the SQL query, we're including our @cutoffID
variable in the INNER JOIN
which is allowing the query execution to greatly limit the results of the cross-product. In fact, since our cut-off ID is already limiting us to the most recent 10 IDs in the blog_comment
table, we can remove the LIMIT 10
from the second portion of the query since it is now redundant (given that all of our JOIN
relationships are one-to-one).
If we run an EXPLAIN
on this new query, we get the following:
mysql> EXPLAIN SELECT .... ;
+----+-------------+-------+------------+--------+------------------------------+---------+---------+------------------------+------+----------+----------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+------------------------------+---------+---------+------------------------+------+----------+----------------------------------+
| 1 | SIMPLE | c | NULL | range | PRIMARY,IX_join,IX_by_member | PRIMARY | 4 | NULL | 10 | 100.00 | Using where; Backward index scan |
| 1 | SIMPLE | b | NULL | eq_ref | PRIMARY | PRIMARY | 4 | bennadel.c.blogEntryID | 1 | 50.00 | Using where |
| 1 | SIMPLE | m | NULL | eq_ref | PRIMARY,IX_info | PRIMARY | 4 | bennadel.c.memberID | 1 | 100.00 | NULL |
+----+-------------+-------+------------+--------+------------------------------+---------+---------+------------------------+------+----------+----------------------------------+
3 rows in set, 1 warning (0.01 sec)
As you can see, we're back to performing a "Backward index scan" on the primary key of the blog_comment
table that filters the cross-product to 10-records. Only this time, it's performing a "range" look-up with our pre-calculated @cutoffID
value.
That said, this query and the previous query are not equivalent. In the first version of this query, if I were to mark the most recent blog post as inactive, the query would simply move onto the active blog posts and return the 10 most recent comments. However, in this latter query, if I were to mark the most recent blog post as inactive, I would likely lose results since my @cutoffID
doesn't take into account any inactive blog posts.
To balance out performance and robustness, I could do something like limit the @cutoffID
to the 100th or 200th comment ID and then add the LIMIT
back into the final query. This would build-in some wiggle-room for the isActive=1
JOIN
condition.
But, since I almost never mark a blog post as inactive, especially after it has received comments, I'm not even going to worry about that edge-case at this time.
Ultimately, I'm the one responsible for running EXPLAIN
on my SQL queries before I deploy them to production. So, this was just sloppiness on my part. But, it's pretty cool to see how the MySQL query optimizer was able to save me from myself in at least one of the full table scan scenarios.
Want to use code from this post? Check out the license.
Reader Comments
In general, I despise sql syntax and I've always found it confusing. I find it easier to decipher full blown assembler, than SQL syntax. I guess that makes me odd. So unfortunately, the sql in your post looks like:
But maybe that's why I still can't find work 🤔
@Anthony,
It's just arc-tangents and assembler code for you, huh? 😆 But, seriously, SQL definitely rubs some people the wrong way. I know there are people who love document databases; but, when I look at the syntax for dealing with those APIs, they feel soo wonky to me. I guess so much of it is just what you're used to looking at.
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →