Creating A Composite Index Using Ancestral Keys In A Hierarchical Database Table Design
For the most part, when designing database tables that exist in some sort of an hierarchical relationship, the "child" table will contain a key reference to the "parent" table. So, for example, a "list item" table would contain a listID
that references its "container" list. This works well for small hierarchies; but, as the complexity of my applications has grown, what I've discovered is that it would often be beneficial for a table to reference not just its parent key but, rather, the entire set of slow-changing ancestral primary keys. And, it would be helpful to have all of these keys wrapped up in a single, composite index.
To explore this concept, consider InVision. InVision's platform revolves, in part, around the concept of Prototypes. Enterprises have prototypes; prototypes have screens; screens have conversations; conversations have comments; and comments have "unread flags" for users.
Several of these relationships will never change. Meaning, a comment will never be moved to a different screen; and, a screen will never be moved to a different prototype. But, it is possible to transfer a prototype from one enterprise to another. That said, these transfers happen infrequently, so I am including the enterprise-prototype relationship in the list of "slow-changing" relationships.
Now, currently, all of these relationships are modeled with a single "parent" key column in each "child" table (more or less):
prototype
->companyID
screen
->prototypeID
conversation
->screenID
comment
->conversationID
unread_flag
-> (commentID
,userID
)
This works well for simple operations. But, it starts to get complex when larger, more sweeping operations need to be performed. For example, if I delete a screen
, I have to delete all of the conversation
, comment
, and unread_flag
records that pertain to that screen. This gets a bit hairy, especially since I don't use triggers to perform cascading deletes.
ASIDE: I try to avoid triggers as much as possible because it is a non-obvious way to alter data. Meaning, a developer - new to a particular area of the application - would have to know that "business logic" around a particular operation is spread across both the application code and the database schema design. This is asking for trouble, in my opinion; and, is why I keep all "business logic" in the application code.
What could be possible awesome is if each of these tables contained a composite index that referenced the entire hierarchy of its slow-changing parent-child relationships:
prototype
-> (companyID
)screen
-> (companyID
,prototypeID
)conversation
-> (companyID
,prototypeID
,screenID
)comment
-> (companyID
,prototypeID
,screenID
,conversationID
)
As you can see, each table contains not only its parent key, but rather, the entire set of ancestral keys. To be clear, this is a trade-off. Having a more robust composite index design makes the indices more expensive to store and maintain. It also makes some of the queries more complex. But, it would make some types of queries much easier. For example, if you needed to delete an enterprise, you could quickly and easily delete all records owned by that enterprise with nothing more than the companyID
value.
Now, you may notice that I don't have unread_flag
in the list of tables above. That's because this one is a bit tricky when you think about data-access patterns. If you recall from my post on the art of database index design, indices are built to support the data-access patterns of the application. As such, they need to be designed based on how they will be consumed. And while an "unread flag" is owned by a screen that is ultimately owned by an enterprise, "unread flags" will only ever be consumed by a user.
What this means is that it probably makes sense to put the userID
column at the front of the hierarchical, composite key. While the lower-level data is all owned by the "enterprise", the "unread flag" itself is conceptually owned by the "user":
unread_flag
-> (userID
,companyID
,prototypeID
,screenID
,conversationID
,commentID
)
Of course, this unread_flag
table is also going to be consumed in "delete" operations as well. And, the deletes are usually coming from the other direction. Meaning, deleting a comment
, a conversation
, a screen
, or a prototype
would all require the deleting of unread_flag
records as well. So, in this case, it may make sense to have two composite keys:
unread_flag
-> (userID
,companyID
,prototypeID
,screenID
,conversationID
,commentID
)unread_flag
-> (companyID
,prototypeID
,screenID
,conversationID
,commentID
)
Notice that the second one doesn't include the userID
at the end because it doesn't need to - any userID
-based query should be handled by the first composite key.
This unread_flag
table is a great illustration of how database index design is both about data-access patterns and performance trade-offs. These two indices are mostly overlapping in terms of what they have to store (and maintain). But, they are overlapping in a way that promotes a variety of queries that will be performed by the application.
Database index design is always a matter of trade-offs. Period.
That said, this type of database index design isn't just a trade-off at the storage level, it's also a trade-off at the query level. Since database indices can only be consumed using a left-prefix, in order to use a lower-level identifier, you also have to supply all of the higher-level identifiers in the same query.
What this means is that traditionally simple queries like this:
SELECT
cv.*
FROM
conversation cv
WHERE
cv.screenID = ? -- Assuming this table has an index on `screenID`.
... now become more complex queries like this:
SELECT
cv.*
FROM
conversation cv
WHERE
cv.companyID = ?
AND
cv.prototypeID = ?
AND
cv.screenID = ?
In order for the composite index to be consumed, the entire left-prefix of the index has to be provided in the query. This adds a certain level-of-effort; but, should be fairly manageable. After all, if a user is digging into the lower-level areas of data, they are usually doing so in a context in which the higher-level identifiers have already been selected.
But, to reiterate, database index design is always a matter of trade-offs.
While some SELECT
queries get more complex in this approach, many DELETE
queries become so simple they can't contain errors. For example, if we needed to delete a prototype
, we could run this set of queries:
DELETE FROM `unread_flag` WHERE `companyID` = ? AND `prototypeID` = ?;
DELETE FROM `comment` WHERE `companyID` = ? AND `prototypeID` = ?;
DELETE FROM `conversation` WHERE `companyID` = ? AND `prototypeID` = ?;
DELETE FROM `screen` WHERE `companyID` = ? AND `prototypeID` = ?;
DELETE FROM `prototype` WHERE id = ?;
NOTE: I wouldn't do this in a single query. I am only writing this as a series of
DELETE
statements for the demo. In all likelihood, each of these queries would be owned by a different entity Gateway within the application code.
As you can see, with these composite keys, whole sets of complex, nested hierarchical data can be deleted with very simple queries.
The goal of this index design approach is to make read-access faster at the expense of making write-access slower. And while write-access is slower (due to index maintenance), many of the deletion and aggregation algorithms become much more straightforward because they have a powerful and flexible left-prefix to work with.
This concept isn't something that I've had much of an opportunity to test, given the brown-field nature of most of my work. But, in the small areas of the application that I've begun to apply this, it seems to provide a net-positive outcome. It is definitely an approach that I am eager to start implementing more often.
Want to use code from this post? Check out the license.
Reader Comments
@All,
I wanted to follow-up with another quick post about index design, this time using a range condition:
www.bennadel.com/blog/3644-considering-index-design-when-using-a-nullable-datetime-column-to-record-a-scheduled-action.htm
The fun thing about a range condition is that MySQL will only consume a composite index up to and including the range condition. Which means, you probably have to organize your composite index in the reverse order from which you think about it.
Ben. This is a great idea. I never thought about building an index this way. Just out of interest, have you done any speed comparison tests between non indexed, normally indexed and ancestral key indexed read access? I would be very interested to see kind of speed savings that this approach provides.
@Charles,
I don't have any speed-tests currently. But, what I can say is that having to dig down through multiple tables to access related data can definitely be very slow. I see that in my work. Take, for example, the
unread_flag
table. Right now, in my work database, that table only relates touser
andcomment
. Which means, if I need to know how many "unread comments" a user has for a given Prototype, I have to perform aJOIN
like this:That is not optimal :D Now, compare that to being able to just doing:
Not only does this only reference a single table, but the index that it is consuming is a covering index. Meaning, all of the data that I need for this query is contained entirely within the index itself. Which means, this has to do no file I/O -- it just reads entirely from the index.
I have to believe this will be faster always. For this particular query.
That said, every query is different, so I can't say. But, given this
unread_flag
index, I can easily ask the following questions without having to join to any tables:Essentially, each of these questions just consumes more of the left-prefix of the index. And, that's why it is so flexible.
But again, I have no hard evidence -- just a gut feeling :D