Showing The Affects Of Database Index Structure On Membership Table JOIN Performance In MySQL 5.6.37
At InVision, a couple of the engineers on my team are building a new feature that will entail a "membership" pattern of data organization. Meaning, there will be "containers"; and then, "items" within those "containers". Yesterday, we began to noodle on the index design of those database tables; and, while I discussed this type of relationship in my Not-So-Dark Art of Designing Database Indexes article, I did leave the low-level details a bit hand-wavy. As such, I thought it would be fun to take a closer look at how a "membership" style table can be indexed; and, what the EXPLAIN
output of the table is along the way in MySQL 5.6.37.
To create a demo context for this, imagine that I have three tables:
club
club_membership
user
The club
table is the "container" in which the user
records will be "contained". And, the club_membership
table is the "join" table (so to speak) that defines exactly which user
records live inside which club
records. The tables - prior to any relevant indexing - look like this:
CREATE TABLE `club` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(50) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `club_membership` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`clubID` int(10) unsigned NOT NULL,
`userID` int(10) unsigned NOT NULL,
`startedAt` datetime NOT NULL,
`endedAt` datetime DEFAULT NULL,
`isFavorite` tinyint(1) unsigned NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `user` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(50) NOT NULL,
`email` varchar(75) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `email` (`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
As you can see, the club_membership
table associates club
and user
records using the date-delimiters, startedAt
and endedAt
. For a club membership to be active, the endedAt
column will be set to NULL
.
The club_membership
table will be used to facilitate three basic query patterns without our demo application:
Show me all active members that are in a given club.
Show me all clubs in which a given user is an active member.
Check to see if a given user is an active member of a given club.
Since none of these questions can be answered using the club_membership
's primary key, it means that we will have to use secondary indexes to drive these look-ups. Let's look at each of these different use-cases.
Show Me All Active Members That Are In A Given Club
To represent this use-case, I'm going to use the following SQL query:
EXPLAIN
SELECT
u.id,
u.name
FROM
club_membership cm
INNER JOIN
club c
ON
(
cm.clubID = 72 -- The club (container) in question.
AND
cm.endedAt IS NULL -- Active memberships.
AND
c.id = cm.clubID
)
INNER JOIN
user u
ON
u.id = cm.userID -- The users (items) within the club (container).
;
As I've written before, I structure my JOIN
queries using an exacting methodology that places the "what I have" data on right-side of all comparisons and the "what I need" data on the left-side of all comparisons. I highly, highly, highly recommend this approach, as it brings a consistency to your SQL queries that enhances the readability; and, as we're about to see, makes designing indexes much more obvious.
If we run the above EXPLAIN
statement before we add any indexes to the club_membership
table, we get the following output:
mysql> EXPLAIN -- .... (truncated) .... no indexes yet.
-> ;
+----+-------------+-------+--------+---------------+---------+---------+-----------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+-----------------------+-------+-------------+
| 1 | SIMPLE | c | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
| 1 | SIMPLE | cm | ALL | NULL | NULL | NULL | NULL | 19224 | Using where |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | invisionapp.cm.userID | 1 | NULL |
+----+-------------+-------+--------+---------------+---------+---------+-----------------------+-------+-------------+
3 rows in set (0.00 sec)
As you can see, in order to execute this query, the database would need to scan 19,224 records from the club_membership
table. There are no keys available for this query. And, the Using where
in the Extra column indicates that it will be doing a full-table scan.
Obviously, we almost never want a full-table scan; so, let's start adding an index. As I said above, my JOIN
statements are structured using a "what I have" / "what I need" organization. And, the first "what I have" is the id
of the club
we are looking-up. So, let's create an index that starts with the "what I need", clubID
:
ALTER TABLE `club_membership`
ADD INDEX `IX_byClub` (`clubID`) USING BTREE
;
With this index, if we re-run the EXPLAIN
we get the following output:
mysql> EXPLAIN -- .... (truncated) .... index: (`clubID`).
-> ;
+----+-------------+-------+--------+---------------+-----------+---------+-----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+-----------+---------+-----------------------+------+-------------+
| 1 | SIMPLE | c | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
| 1 | SIMPLE | cm | ref | IX_byClub | IX_byClub | 4 | const | 18 | Using where |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | invisionapp.cm.userID | 1 | NULL |
+----+-------------+-------+--------+---------------+-----------+---------+-----------------------+------+-------------+
3 rows in set (0.00 sec)
Awesome! This time, the database is using our key, IX_byClub
, and the number of rows that it will have to scan dropped from 19,224 records down to 18 records! That's a massive improvement. But, the Extra column still has Using where
, which means that the database is still reading both indexed data (from the key IX_byClub
) and non-indexed data in order to satisfy the INNER JOIN
on club_membership
.
The next "what I have" / "what I need" question in the JOIN
is for endedAt IS NULL
. So, let's try adding endedAt
to the index:
ALTER TABLE `club_membership`
DROP INDEX `IX_byClub`,
ADD INDEX `IX_byClub` (`clubID`, `endedAt`) USING BTREE
;
With this index, if we re-run the EXPLAIN
we get the following output:
mysql> EXPLAIN -- .... (truncated) .... index: (`clubID`, `endedAt`).
-> ;
+----+-------------+-------+--------+---------------+-----------+---------+-----------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+-----------+---------+-----------------------+------+-----------------------+
| 1 | SIMPLE | c | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
| 1 | SIMPLE | cm | ref | IX_byClub | IX_byClub | 10 | const,const | 7 | Using index condition |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | invisionapp.cm.userID | 1 | NULL |
+----+-------------+-------+--------+---------------+-----------+---------+-----------------------+------+-----------------------+
3 rows in set (0.00 sec)
Getting better! Now that we've added endedAt
to the index, the MySQL database will only have to scan 7-records instead of 18-records. This is because we've built the "active membership" concept into the index.
Our Extra column has also changed to Using index condition
. This means that the MySQL database is using the index to filter the tables in the first INNER JOIN
; but then, has to read the entire row in order to process the second INNER JOIN
. This isn't inherently bad - our performance at this point will be pretty solid. But, we can do better!
NOTE: I am not good at clearly articulating what the Extra column values mean. This is because I don't have a solid mental model for it. And, I usually have to look the values up when I'm confused. Unfortunately, the MySQL documentation regarding the Extra column rarely adds much clarity :D As such, please bare-with my less-than-stellar explanation of how our Extra column readings evolve through this article.
At this point, our index contains all of the "what I need" / "what I have" conditions to satisfy the first INNER JOIN
. As such, let's move onto the second INNER JOIN
to the user
table. To satisfy this JOIN, we need the userID
from the club_membership
table. As such, let's add it to the index:
ALTER TABLE `club_membership`
DROP INDEX `IX_byClub`,
ADD INDEX `IX_byClub` (`clubID`, `endedAt`, `userID`) USING BTREE
;
Now, with this compound, secondary index in place, if we rerun the EXPLAIN
for our query, we get the following output:
mysql> EXPLAIN -- .... (truncated) .... index: (`clubID`, `endedAt`, `userID`).
-> ;
+----+-------------+-------+--------+---------------+-----------+---------+-----------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+-----------+---------+-----------------------+------+--------------------------+
| 1 | SIMPLE | c | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
| 1 | SIMPLE | cm | ref | IX_byClub | IX_byClub | 10 | const,const | 7 | Using where; Using index |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | invisionapp.cm.userID | 1 | NULL |
+----+-------------+-------+--------+---------------+-----------+---------+-----------------------+------+--------------------------+
3 rows in set (0.00 sec)
We're still scanning 7-records; but, notice that our Extra column has changed from Using index condition
to Using where; Using index
. What this means is that the MySQL database is now able to pull all needed information out of the index itself. In other words, we've created a covering index for the club_membership
table in this particular use-case / query pattern.
A covering index is always a good thing for read-performance. However, it can be bad for write-performance since a more complex index requires more maintenance as more data is written to the table. That said, since this is one of the primary use-cases that this table, I'm more than happy to create covering indexes.
Show Me All Clubs In Which A Given User Is An Active Member
Now that we've mastered the previous use-case, let's move on to identifying the clubs in which a given user has an active membership. To represent this use-case, I'm going to use the following SQL query:
EXPLAIN
SELECT
c.id,
c.name
FROM
club_membership cm
INNER JOIN
club c
ON
(
cm.userID = 24 -- The user (item) in question.
AND
cm.endedAt IS NULL -- Active memberships.
AND
c.id = cm.clubID -- The clubs (containers) for this user (item).
)
;
If we run EXPLAIN
on this query - leaving in-place the index from our previous use-case - we get the following output:
mysql> EXPLAIN -- .... (truncated) .... no RELEVANT indexes yet.
-> ;
+----+-------------+-------+------+---------------+-----------+---------+------------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+-----------+---------+------------------------------+------+--------------------------+
| 1 | SIMPLE | c | ALL | PRIMARY | NULL | NULL | NULL | 1000 | NULL |
| 1 | SIMPLE | cm | ref | IX_byClub | IX_byClub | 14 | invisionapp.c.id,const,const | 1 | Using where; Using index |
+----+-------------+-------+------+---------------+-----------+---------+------------------------------+------+--------------------------+
2 rows in set (0.00 sec)
This output is a little confusing because the Using where; Using index
Extra might lead you to think that we're using a covering index; but, if you look at the club
table, you'll see that we're reading in 1,000-records. This is a full-table scan (I put 1,000 records into the club
table in order to explore this topic). Essentially, the query-optimizer is reversing our INNER JOIN
look-up in order to try and use an existing index; however, since none of the indexes were designed for this use-case, the result is terrible.
So, going back to the "what I have" / "what I need" mentality for INNER JOIN
structuring, we can see that the first condition is the userID
that we're trying to look-up. So, let's add that to an index:
ALTER TABLE `club_membership`
ADD INDEX `IX_byUser` (`userID`) USING BTREE
;
And, with this index, if we re-run the EXPLAIN
we get the following output:
mysql> EXPLAIN -- .... (truncated) .... index: (`userID`).
-> ;
+----+-------------+-------+--------+---------------------+-----------+---------+-----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+-----------+---------+-----------------------+------+-------------+
| 1 | SIMPLE | cm | ref | IX_byClub,IX_byUser | IX_byUser | 4 | const | 672 | Using where |
| 1 | SIMPLE | c | eq_ref | PRIMARY | PRIMARY | 4 | invisionapp.cm.clubID | 1 | NULL |
+----+-------------+-------+--------+---------------------+-----------+---------+-----------------------+------+-------------+
2 rows in set (0.00 sec)
This has gotten a little bit better, dropping the table scan from 1,000-records down to 672-records. And, you can see the query optimizer is using the index - IX_byUser
- that we just added. But, the Extra column shows Using where
, which means that we're reading in both indexed and non-indexed data.
To improve upon this, let's add endedAt
to the index, which is the next column in the "what I have" / "what I need" conditions:
ALTER TABLE `club_membership`
DROP INDEX `IX_byUser`,
ADD INDEX `IX_byUser` (`userID`, `endedAt`) USING BTREE
;
With this index, if we re-run the EXPLAIN
we get the following output:
mysql> EXPLAIN -- .... (truncated) .... index: (`userID`, `endedAt`).
-> ;
+----+-------------+-------+--------+---------------------+-----------+---------+-----------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+-----------+---------+-----------------------+------+-----------------------+
| 1 | SIMPLE | cm | ref | IX_byClub,IX_byUser | IX_byUser | 10 | const,const | 314 | Using index condition |
| 1 | SIMPLE | c | eq_ref | PRIMARY | PRIMARY | 4 | invisionapp.cm.clubID | 1 | NULL |
+----+-------------+-------+--------+---------------------+-----------+---------+-----------------------+------+-----------------------+
2 rows in set (0.00 sec)
At this point, our index structure has dropped the row-read count from 672-records to 314-records, which represents the active club memberships for the given user. But, our Extra column says Using index condition
, which means that the MySQL database is using the index to drive the filtering; but then, needs to read-in the full table row in order to process the full INNER JOIN
.
To improve upon this, we can add our final "what I have" / "what I need" condition to the index: clubID
.
ALTER TABLE `club_membership`
DROP INDEX `IX_byUser`,
ADD INDEX `IX_byUser` (`userID`, `endedAt`, `clubID`) USING BTREE
;
And with this index, if we re-run the EXPLAIN
we get the following output:
mysql> EXPLAIN -- .... (truncated) .... index: (`userID`, `endedAt`, `clubID`).
-> ;
+----+-------------+-------+--------+---------------------+-----------+---------+-----------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+-----------+---------+-----------------------+------+--------------------------+
| 1 | SIMPLE | cm | ref | IX_byClub,IX_byUser | IX_byUser | 10 | const,const | 315 | Using where; Using index |
| 1 | SIMPLE | c | eq_ref | PRIMARY | PRIMARY | 4 | invisionapp.cm.clubID | 1 | NULL |
+----+-------------+-------+--------+---------------------+-----------+---------+-----------------------+------+--------------------------+
2 rows in set (0.00 sec)
We're still scanning 315-records, which is the number of active club memberships belonging to this user; but, notice that our Extra column has changed from Using index condition
to Using where; Using index
. This means that the MySQL database is now able to pull all needed information out of the index. Which, again, means that we've created a covering index for the club_membership
table in this particular use-case / query pattern.
And again, a covering index - for a primary use case of the table - is awesome!
Check To See If A Given User Is An Active Member Of A Given Club
Our final use-case for the club_membership
table is checking to see if a given user has an active membership to a given club. A query like this would normally be used to assert some sort of permission on behalf of the user; or, act as a guard to some sort of membership mutation.
To represent this use-case, I'm going to use the following SQL query:
EXPLAIN
SELECT
1
FROM
club_membership cm
WHERE
cm.clubID = 72 -- The club (container) in question.
AND
cm.userID = 24 -- The user (item) in question.
AND
cm.endedAt IS NULL -- Active memberships.
;
Notice that I am pulling back 1
in the SELECT
statement. In this case, that's because I don't really care about the data itself - only that the data exists or doesn't exist indicating an active club membership or an ended / non-existent club membership, respectively.
Now, if we run the EXPLAIN
on this SQL query, we get the following output:
mysql> EXPLAIN -- .... (truncated) .... only previous indexes.
-> ;
+----+-------------+-------+------+---------------------+-----------+---------+-------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------+-----------+---------+-------------------+------+--------------------------+
| 1 | SIMPLE | cm | ref | IX_byClub,IX_byUser | IX_byClub | 14 | const,const,const | 1 | Using where; Using index |
+----+-------------+-------+------+---------------------+-----------+---------+-------------------+------+--------------------------+
1 row in set (0.00 sec)
Woot woot! As you can see, this particular use-case is already handled by the IX_byClub
index, which makes sense since this index has all of the columns that we're looking for - remember what we did above:
ADD INDEX `IX_byClub` (`clubID`, `endedAt`, `userID`) USING BTREE
The difference with this query is that we're not using the userID
to power a second INNER JOIN
. Instead, we're using it to filter within the first INNER JOIN
. And, if you look at the Extra column, we have Using where; Using index
which, again, means that we have a covering index. Or, in other words, all of the data that the MySQL database needs to satisfy this query comes directly out of the index - the database doesn't have to read any table rows for this use-case with these indexes.
Writing SQL Is Thrilling
Sometimes, there's nothing quite as satisfying as creating database tables and designing indexes that power your application. Hopefully this exploration has helped illustrate how the structure of an index lines-up with the structure of your SQL queries; and, how you might leverage intentful conditions of your INNER JOIN
clauses in order to help identify which columns need to be placed in your database indexes.
Index design can be scary; but, if you remember that indexes are meant to evolve along-side your application and its query-patterns, you can start to let go of some of that stress.
Want to use code from this post? Check out the license.
Reader Comments
I learn so much about EXPLAIN through your articles. These are the finer points of SQL which are easy to ignore, but so useful to bring into the light. Thanks Ben!
@Chris,
The
EXPLAIN
command is awesome; but, I will freely admit that it is generally confusing; and, that I am often confused by it :D The high-level points that I look for are that the tables are:The Extra column is additional information, but definitely ups-the-level of fuzziness in how I think about the query. Here's the docs for
Explain
:https://dev.mysql.com/doc/refman/8.0/en/explain-output.html
... one look at the docs and I'm more confused than I was when I started :D
Ben. Interesting stuff. Just out of interest, why did you not write the 2nd Query like:
I usually, only add ID comparisons to my ON clause?
@Charles,
It's a personal preference. I like to do as much of my filtering in the same place that I reference the table. By moving
WHERE
conditions into myON
clauses, I can see the table and the filtering of that table in the same physical location within the SQL statement.I am sure that under-the-hood the query planner is doing the same thing in both cases; meaning, I doubt there is any technical difference between the two approaches. It's all just mental.
One thing that I like about having the two "aspects" in the same place is that I think it forces me to think more clearly about limiting the cross-product of the various tables. When I think about a query in two difference phases (ie, "join" and then "filter"), I don't think as clearly about how I want to reduce the number of rows that get scanned.
On the other hand, when I think about each table and the filtering at the same time, it just makes it more clear in my head that I am reducing the cross-product as much as possible.
To be honest, I thought it was only possible to compare the Primary Key & Foreign Key in the ON clause. I didn't realise you could add any kind of comparison?
So, this has been a valuable lesson...
@Charles,
Right? SQL is super flexible. It's kind of an amazing language. No wonder it has lasted so long. And, some of the DB systems have all kinds of crazy extensions that I can't even wrap my brain around.
Yes. And MySQL Server is super reliable.
Yesterday I set up a new VPS and decided I couldn't be bothered to do a full MySQL install.
So, I downloaded MySQL 5.5 Server. No workbench. 30MB download. It took about 10 seconds to install and is working perfectly!
At work, we use MSSQL and the install is about 2GB. And that is without MSSQL Management Studio, which is another gigabyte! It is seriously top heavy.
Give me MySQL any day!
@Charles,
I used to use MSSQLServer back in the day (when I worked for an MS shop). I remember the management studio being a pretty solid application. Though, at one point, they updated it and it got a loooooot more complicated.
I've heard people rave about PostgreSQL, but I've not used it personally.