SQL Join Table vs. SQL Entity Table And When To Switch
I have to say that I have never been too educated when it comes to Database architecture. I have picked up tips here and there and found patterns that I really like, but overall, no one has ever sat me down and taught me the tried and true methods of database development. Occassionally, this comes to light. About a year ago, I was on the phone with Frank Yang, CTO/CIO of RecycleBank, discussing changes that we needed to make to their systems database. I kept referring to one of the new tables we were designing as a "join table". Frank found this very confusing because it contained more than foreign keys. As he explained, industry standards dictate that Join tables only contains the foreign keys of the tables being joined. If you have anything more than that, you no longer have a join table, you have an entity table.
It was very hard for me to accept this fact, and we argued over this for some time before coming to a conclusion (of which I cannot remember). Over time, however, I have slowly begun to see Frank's point. Once you do have more than just foreign keys in a table, you really are modelling a totally different kind of relationship between two entities. And, that relationship, often times, should become an entity in and of itself.
Let's walk through a thought experiment so I can think out loud and try to explain myself. Let's pretend we the following two tables:
boy
-------------------
id - int (pkey)
name - varchar
age - tinyint
girl
-------------------
id - int (pkey)
name - varchar
age - tinyint
Pretty simple. Sure, this could be one table with a gender column, but for our demonstration, we need to tables.
Now that we have these two entity tables, let's join together the boys and girls. My initial reaction would be to create a join table with the keys of the two joined entities:
boy_girl_jn
-------------------
boy_id - int
girl_id - int
This is a "standard" join table. All it does is join two entity tables in a many to many relationship. Notice also that my naming convention dictates that all join table names consist of the name of the two entity tables in alphabetical order followed by, "_jn", which gives us "boy_girl_jn".
So far so good. But these foreign keys don't really tell us much about this relationship at all. It merely tells us that N boys are related to N girls in some way. Let's go ahead now and say that this relationship can be either romantic or platonic. We can define this using a boolean, is_romantic column:
boy_girl_jn
-------------------
boy_id - int
girl_id - int
is_romantic - tinyint
Now our join table has one additional, tiny piece of information. No problem - this doesn't really affect our join too much. But now, let's come to accept the sad realization that relationships don't always last; friends become enemies, enemies become friends, dogs and cats living together - it's the circle of life. In a join table, if a relationship ends, we delete the join records; however, these are human relationships and we don't simply want to delete them if they end - we want to record their start and finish for historical purposes. How do we deal with that? Easy, let's add a start date and an end date:
boy_girl_jn
-------------------
boy_id - int
girl_id - int
is_romantic - tinyint
date_started - datetime
date_ended - datetime (allows null)
With these two new columns, date_started and date_ended, we start to hit a problem - the existence of a "join" record no longer indicates the current relationship between the two entities involved; there might very well be a join record for a boy and a girl that have no relationship currently. While this is ok from a data standpoint - we have all the necessary fields - you can see that intent of a table has shifted dramatically. We are no longer just creating a many to many relationship between two entities, we are staring to model a third entity all together - the human relationship.
It's time that we convert out join table to an entity table. Let's take boy_girl_jn and create "relationship":
relationship
-------------------
id - int (pkey)
boy_id - int
girl_id - int
is_romantic - tinyint
date_started - datetime
date_ended - datetime (allows null)
I don't know about you, maybe this is just me, but now that we have an entity table rather than a join table, my mental view of it has changed completely. Something about the fact that it is an entity table totally frees me up to think about it in a much more rich and robust way. Now, I can see other attributes that I might want to add, such as which party started and ended the relationship:
relationship
-------------------
id - int (pkey)
boy_id - int
girl_id - int
is_romantic - tinyint
initiated_by_gender - char( M / F )
ended_by_gender - char( M / F )
date_started - datetime
date_ended - datetime (allows null)
And what about marriage? Did any of these romantic relationships lead to marriage? We can add a column for that to:
relationship
-------------------
id - int (pkey)
boy_id - int
girl_id - int
is_romantic - tinyint
initiated_by_gender - char( M / F )
ended_by_gender - char( M / F )
date_started - datetime
date_ended - datetime (allows null)
date_married - datetime (allows null)
You can just go on from there, adding a description, a rating, etc.; anything else you can think of to define a human relationship.
And again, there is nothing that says we could not have technically just put all of these columns in our original boy_girl_jn table; we're not really talking about technicalities here - we're talking philosophy. As we added more than just the foreign keys to our join table, the intent and the meaning of our table took a dramatic turn. We really were starting to model another entity rather than a table relationship.
And what about recurring relationships? With a human relationship, just because it ends, it doesn't mean that it won't start again - I am sure we have all been there at some point. I was in a relationship once that had two "romantic" phases separated by a "friend" phase. This could be recorded as three separate relationships with different start and end dates. When you get something like this happening, you can really see that the concept of a "join" table has been totally eradicated.
Thanks for letting me think out loud. I hope that you got something out of this.
Reader Comments
But Ben, your relationship table isn't politically correct. What if the relationship needed to be boy_id and boy_id? "Not that there's anything wrong with that."
@Matt,
"Should we even be talking about this?" :)
It's funny you mention that as I thought about it as I was writing the post. I felt like I just wanted to keep it simple.
@Matt
PERSON_ID would be the most PC.
;^)
@Michael,
I always have difficulty figuring out what the actual columns should be called. In this case, we would have two person_id values for the halfs of the relationship. I end up doing things like:
person_id_1
person_id_2
But I always hate that. I never feels descriptive to me. Any suggestions?
I reckon this is when you start to make words up..
relationshipper1
relationshipper2
easy! :)
Mat
Shouldn't you actually be thinking about a join table + an entity table?
eg:
boys [entity]
girls [entity]
relationship_details [entity]
boys_girls_relationship_details_jn
Nothing stopping you from linking more than one table, after all...
@Jason,
I had not thought of that. However, since the relationship details table cannot link to more than one "relationship", I am not sure you will gain anything by having the join table.
Well I'm just glad that you implicity acknowledged that no relationship ends on mutual terms! People always say it was a mutual thing, but someone always wants it to end more than the other!!
@Pete,
That reminds me of that Seinfeld episode where Jerry and Genine Garofalo agree to break up at the exact same time; he claims its the first mutual break up in history :)
Ben,
Not sure about that. Couldn't people potentially have multiple types of "relationship" with people? Consider the way Facebook/LinkedIn/etc work - you can be friends with someone, a coworker, gone to school together, etc.
Has there ever been an attempt to model life in a db?
A real model would be infinitely big, but maybe a simplified model? Not sure what you would use it for but would be interesting to see how it worked.
Mat
To make it politically correct, you could take the PersonIDs out of the relationship table completely, and instead have a join table (don't you love how this comes full circle?).
create table Person (PersonID int, Name char(1), Sex bit)
create table Relationship (RelationshipID int, IsRomantic bit)
create table RelationshipPeople (PersonID int, RelationshipID in)
So now I can have two boys or two girls in a relationship. I can also have threesomes and so on. Well obviously, *I* can't have threesomes because I'm a geek who would bother to model this in SQL. Also, a person can be in multiple relationships.
@Chris,
That last line was too funny :) I like what you've done there. That's a cool solution.
Ben,
Not sure about that. Couldn't people potentially have multiple types of "relationship" with people? Consider the way Facebook/LinkedIn/etc work - you can be friends with someone, a coworker, gone to school together, etc.
I disagree with you on this (and agree with your initial train of thought). There is no fundamental difference between a relationship table and entity table, in much the same way there is no difference between a circle and an ellipse - one is simply a specific instance of the other. In the geometric example, a circle is an ellipse where the foci just happen to be collocated; in the database example, a relationship table is an entity table that just happens to have no additional columns.
Every relationship (be they one-to-one, one-to-many or many-to-many) can be garnished with additional details if one so chooses. In the case of one-to-one and one-to-many relationships, we might create a table between the existing two and add the details there. The only difference with a many-to-many is that the table already exists.
Hi Ben.. Love your passion for CF, but when designing databases all tables are entities. With relational modeling, all you have are entities with relationships depicted with cardinality. With dimensional modeling you have fact tables, dimensions and factless fact tables (just keys) or intermediate table to explain the Many to many relationship. I think the terminology really should be intermediate table when dealing with m:m relationships (foreign keys only). Thanks, Tim