How To Store Arbitrary And Transient Attributes With Your User Data
Ok, so this is less of a "how to" lesson and more of a "how do I?" conversation. I'm sure we've all run into this kind of problem before. You need to show some info-box or some widget to a user; and, when the user closes it, you (and your web application) need to remember that "this" user closed "that" info-box such that you don't show the same info to the same user again. Often times, once the user closes the info-box, that [flag] data has no real future value; and that makes me angry because I have to store it. It makes me wish that I had a better strategy for storing that data. So, I wanted to brainstorm some possible approaches.
Context: You need to store some "hasSeenIntroVideo" flag for each user.
Storing Additional Data In A Non-Expiring Cookie
This is probably the first kind of approach that I ever went with. Need to store a conditional user flag? Throw it in a cookie and forget about it. This is actually a really nice approach, to some degree, because it requires very little change in your application logic. No database changes; no data access changes; just a simple cookie check.
The problem with using cookies is that they cannot be relied upon. The user can clear them. The user can jump on a different machine. The user can enter "Incognito" mode in the browser. The user can run out of cookie space (theoretically). When you rely on cookies, it's actually quite easy to show a user some information that they've already seen.
Adding Additional Columns To Your User Table
Once I realized that the cookie approach doesn't work well enough, the next lowest-hanging fruit option is to simply tack on the needed Boolean column in the user database table. The first time you do this in your application, it feels relatively painless. Sure, you have to update the database schema and the data-access layer; but, it's not that much code to touch.
The pain starts to set in when you add more columns over a longer period of time. At some point, it may not even be a simple ALTER statement - at some point, you may not be able to "lock" the table up while adding the new column. When this happens, you have to start building "shadow" tables with the new schema. This is a lot of work.
And, as the work of adding a new column increases, the emotional strain of storing "useless" data starts to set in. You mean it's gonna take me five hours to implement a stupid flag so that the Product team can show a Happy Holidays modal to the user? Come on!
Adding A Hash Column To Your User Table
One time, when a table got too big to easily tack on a new column, I tried to future-proof the approach by adding a "hash" column. By this, I mean that I added a "text" column that would store a JSON (JavaScript Object Notation) value. This way, new flags could be added to this hash without having to change the structure of the table.
While this approach worked, it never sat right with me, philosophically; I feel like it corrupted the concept of "relational" database schemas. Philosophy aside, this approach also requires more code because you now have to serialize and deserialize the hash as it moves into and out of the database. Also, you can't [easily] query the database table based on various flags stored within the hash.
Treating Transient Data As First-Class Citizens
The next thing I want to try, though I haven't actually done this yet, is to promote these transient bits of data into actual first-class citizens. By that, I mean that I want to think of them as holistic units of information rather than as "attributes" of something else. So, instead of tacking on a "hasSeenIntroVideo" column onto the "user" table, I'd like to create a separate "intro_video_viewing" table that has a schema like:
- userID (PKey)
- durationInSeconds (tinyint)
- viewedAt (date/time)
Now, instead of setting a column value in the "user" table, I'll actually insert a new record into the "intro_video_viewing" table. Do I really need the last two columns? Probably not. But, adding them helps me transcend the "column" mental model and start seeing this transient data as something more full-bodied and robust.
Of course, this probably requires more code than any of the other approaches as this new model needs to be pulled through every layer of the application. That said, the emotional payoff is that I can now keep my "user" table in a much more clean, more consistent state. And, in time, as this information becomes irrelevant, I can simply DROP the table (and remove the corresponding code).
What About "Useless" Data That Can Never Be Forgotten?
The problem with all of these approaches, is that some data can never be forgotten. And yet, at the same time, it feels like it holds no value. "Intro" and "tour" data is a perfect example of this. Even after a user has been in your system for months or years, you still have to remember to NOT show them the intro video. This means that the "has seen" data need to be persisted. This is super lame and makes me sad.
Perhaps the best approach, and likely the most complicated, is to use the transient data in conjunction with the user data to conditionally query the database. For example, only worry about showing the "intro video" if the user signed up in the last 7 days. And, if the user has been around for more than 7 days, don't bother even checking. This way, the transient data table can be pruned without providing the user with a redundant experience.
Anyway, just some thoughts on the matter. If anyone has any thoughts or advice on the topic, I would love to hear it.
Reader Comments
One thought is: "You're close" with "Adding Additional Columns To Your User Table". Instead of that, though, why not add a separate table, linked by the UserID as a foreign key. Then you keep your primary user data clean, but can throw any old column in. If you're using MSSQL, look into the SPARSE columns. I'm really liking them for places where I can't use a NoSQL solution, which would also be really good for this scenario.
Ooops. Re-reading, I see you have a similar option next. My eyes must've missed it!
@Will,
I'll take a look at the Sparse column stuff. I'm on MySQL, but reading it may be inspirational; and, MySQL may have related constructs.
I think the reason I have been so hesitant with the additional table approach is simply that it makes the app more complex, and/or required a LEFT OUTER JOIN (which I generally try to avoid). That said, I think it's just time to pull my big-boy pants on :)
You're still thinking horizontally when talking about making an "intro_video_viewing" table... what happens as you add more info about viewing the video? More columns. What happens when you need more info about some other aspect of your site? More tables with more columns.
Try this:
Users table (UserID, stuff you need)
UserDataTypes table (UserDataTypeID, label, type [like string, number, etc])
UserData table (UserID, UserDataTypeID, value)
Then just make a stored proc to pull back specific bits of data by UserDataTypes label and cast it to the right type. As you add new data types, you have an expanding capacity to track data and you never have to add more tables.
I don't know that specific user activities deserve their own DB table. With this design, you'd inevitably end up with "remarket_video_viewing" and "email_popup_viewing" tables as well. You'd find yourself with and endless list of "_viewing" tables. Might be better to refine those actions into a set of "user_event" and "user_event_type" tables, but maybe I'm misreading the intent.
Optionally, since you seem to be indicating the user would be logged in, so you'd know who they are, you could load their "extra" data into session storage, and just flush/commit any changes to these areas (session.userFlags, whatever) in onSessionEnd(), or something similar.
Then you could have separate routines for loading said "crazy, mismatched data" without interfering with your primary user save/fetch code. A listener/observer pattern might be an interesting way to go for login/logout.
@Jon, @Aaron,
I think you guys are saying something similar (to each other), and I'm intrigued. If I am understanding you correctly (@Jon, sorry I had some trouble grasping what you said), it sounds like you're suggesting having a simple "Event" type table that I can use to store arbitrary flags, like some long key-value store.
user_event:
-----------
* userID (ex. 4)
* eventType (ex. hasSeenIntroVideo)
* occurredAt (ex 2014/01/15)
Then, I could use a multi-col index to quickly look this stuff up.
If that's what you mean, this is a super interesting idea! One that I had not considered at all. This would allow for me to easily add new events at any time.
Am I understanding?
I'm fine with the "Hash" approach. Fine enough that I've taken this approach a few times. If you view that JSON value as a discrete value, rather than as a set of data points, then it really doesn't violate relational normalization rules.
Yes- whichever method you use, there are compromises to be made. I've found storing JSON to be the least objectionable compromise. And not difficult to implement or change. In fact, this approach once saved me from a massive overrun on a project. Rather than redesigning the relational database structure and then the data layer of the application, I changed the JSON structure.
@Ben,
@Jon & @Aaron are definitely on the right path, also, don't be afraid to use a Left [Outer] Join. I would tend to call the table "user_preference" and create it like this.
user_preferences
--------------
* userId
* preference_type string or id - (this could include the IntroVideo, but give the flexibility in the setting)
* setting char(1) - (for the video, it could be multivalue, like remind-me, never-show, watched, etc., i would probably use a single character or id with a lookup table of values)
* create_date
* update_date
This way you check to see if the user has a preference and if not, set a default of "not set". The design is flexible and can be used in various ways. I would create an asynchronous updater for this table, as the user could update a preference numerous times, so you wouldn't want to hold up any other user action, just to save this.
Anyway, a table is the right model. How often and when to load it will become your next hurdle as a user's preferences could change mid-session. Also, since you're using AngularJS, you could setup an event loop that upon save, pulls the data back and stores it as it own Json object. This way, you aren't overloading the User object and if the Preference object is empty, the user get's defaults.
Hope this helps.
@Ben,
Exactly. It's a more generic design, and scales pretty well.
As a slight modification, you might add a "user_event_type" table, and store it's id on your "user_event" table.
user_event:
-----------
* userID (ex. 4)
* eventTypeID (ex. 2)
* occurredAt (ex 2014/01/15)
user_event_type:
-----------
* eventTypeID (ex. 2)
* eventTypeID (ex. hasSeenIntroVideo)
I don't know that you get any actual benefits in this, but it does feel "more normal".
Ooops, meant to have column 2 called "eventTypeName"...you get the point
This is all really compelling. I'm actually pretty jazzed about this idea and we have something [at InVision] that could put this into use right-away. Currently discussing your ideas with the team.
@Brandon,
I've just always wished my OUTER joins were INNER joins :D
What you guys are basically suggesting is the perfect use case for Redis. Redis is basically a super fast key-value store.
A lot of backends actually use Redis to store their persistent user sessions, so it's well suited for storing persistent user data of any kind.
There are a few adapters for CF and Redis.
@Ben + @Aaron,
Yeah, Aaron's example is pretty much what I'm saying, although his version is restricted to recording timings, where instead of having an something like "occurredAt" with a type of "date", I have a "value" column with whatever type is "any". This allows the table to hold any user data that is arbitrary... such as preferences for colors, time they viewed events, how they last rated the application, etc. If the idea of having a column with multiple variable types unnerves you (and it does some people) then you can extend this by having a table for each type you may want to store in your "value" field such as UserDataNumeric, UserDataDate, etc. that all have the same fields, just with different column types for "value". This does add a few more tables, but makes the "value" field less mutable. The downside is then that your CRUD code around these variables becomes a lot more complex, since you have to join and delete across many more tables and pick which one gets INSERTs based on the type of data.
Example from a past MSSQL project:
User
----
ID, bigint auto increment
... (other user stuff)
... (standard created/updated/deleted fields)
Preference
ID, bigint auto increment
Label, nvarchar
DataType, nvarchar (I actually abstracted this as another table)
... (standard created/updated/deleted fields)
UserPreferences
UserID, bigint, FK User.ID
PreferenceID, bigint, FK Preference.ID
Value, sql_variant (value of the user preference)
... (standard created/updated/deleted fields)
This setup meant that a simple query like:
SELECT
c.[Label] AS [Preference],
c.[DataType] as [Type],
b.[Value]
FROM User AS a
LEFT OUTER JOIN UserPreference AS b
INNER JOIN Preference AS c
ON c.ID = b.PreferenceID
/* And not deleted, etc */
ON b.UserID = a.ID
/* And not deleted, etc */
WHERE
a.ID = {ID}
/* And not deleted, etc */
Or something like that...
This is on a similar track to what @Jon and @Aaron are doing. One of the things that I've done in the past is very similar to their approaches, particularly because it is infinitely expandable.
What did I do differently? I created a view that performed a CROSS JOIN between the user and the data types, added a LEFT JOIN with the value table, then did a PIVOT so each data type is now a column for the user row. This makes it _look_ like the user table has these additional columns, but really it's implemented as a many-to-many in the database.
@Jon,
Ah, Ok, I see what you're saying now. That would definitely make it more flexible. Right now, or at least up until now, we've only have had to store boolean flags for these kinds of things; so, it didn't occur to me that the variation would be helpful. Awesome feedback!
@Paul,
That's an idea I'm wrestling with mentally - when / how often do I want this data? That was what is so appealing about the "tack it onto the user table" approach - it all comes back as one record - easy-peasy. But, some of this data will only ever be used occasionally, so now I'm thinking it might get it's own access methods.
@Jonathan,
I don't know much about Redis right now, but I think we're starting to use in one area of our app. Looking forward to learning more about it.
@Paul,
I've experimented with that in the past and found using PIVOT on the fly to be a dog in performance. Ended up using triggers to populate a denormalized version of the table to address performance and query complexity issues.
How do you approach those problem?
@Ben
I just think clogging up your MySQL database with a key-value store might not be the best solution. This data is not exactly critical (if you were previously willing to use a cookie), probably won't be used in a relational manor, and is infinitely expandable.
Redis couldn't be simpler. It sits in memory and backs itself up to disk. So it's forever persistent but SUPERFAST. There's a Java library called JEDIS which has been implimented in CF here: https://github.com/MWers/cfredis/
Consider that perhaps you want to store how much a video someone has watched and the resume when they come back to your page.
From ColdFusion (assuming you've got Redis library) you just do
Now to read that data you just do
This is WAY faster than a query because it pulls directly from memory and it's extremely convenient if you're using angular because you are one line away from returning that JSON to the front end. No fussing with cfquery.
Oh, and if you don't want to deal with an entire map you can also use
that sets a specific value within a map. And
You should see the benchmarks on this stuff, it's insane. Even with 100,000 keys (users) filled with maps it will probably be less than a millisecond in most cases.
I think what's being described here is the entity attribute value (EAV) data model. You can read about it here:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
You will find many opinions on EAV, some good, some bad. I've found it to be very useful, especially in the use case that you've described Ben.
@Zack,
I haven't had the opportunity to see how well it worked on large databases, unfortunately. I don't know how you would address that. If it's stored as a view, as I had suggested, it would probably be difficult to employ filters prior to doing the PIVOT. However, if you write a stored procedure, you could employ the filter before the PIVOT. Thoughts?
As stated above a Key/Value store like Redis might be a perfect fit. That would be my vote :)
IF all your values are boolean, you may consider having an integer (or larger type) column and use bit switches for the fields. Querying is supposed to be very fast, and when you run out of bits you can add another column.
Downside: not easily human readable.
@Ayelet,
We played around with some bit-masking for certain things. For example, someone needs to go through a "Wizard" or sorts and we need to see which steps they have viewed - we could store each "bit" as a step. That said, I think each example of that was eventually simplified to be a simple on/off.
@Jonathan, @John,
We may actually end up using some Redis at work, so hopefully I can get some experience with it. I'm definitely itching to play around more with key-value store stuff. I've used ColdFusion's built-in cacheGet/cachePut stuff; the problem with it, however, is that it's harder to use when you have multiple web-nodes.
I'm sure smarter people can get the underlying cache management to point to something else; but, I'm not close to being there yet.