Ask Ben: Pulling Unique SQL Records That Match ALL Join Conditions

Posted April 23, 2009 at 10:00 AM

Tags: ColdFusion, SQL, Ask Ben

The reader question is far too long to put here, but the jist of it was that the reader had a primary table that joined to several "property" tables in a One-to-Many kind of relationship. These properties could be selected in a search form and the reader wanted to be able to return all primary table records that matched at least every selected property. To demonstrate the way that I handle this, I set up a simple SQL database with the following tables:

girl
-----------------
id (int)
name (varchar)

trait
-----------------
id (int)
name (varchar)

girl_trait_jn
-----------------
girl_id (int)
trait_id (int)

So that we are all on the same page, I made this video walk through:

 
 
 
 
 
 
 
 
 
 

This is a simple, one page form that posts to itself:

 Launch code in new window » Download code as text file »

  • <!--- Param form value. --->
  • <cfparam
  • name="FORM.list_trait_id"
  • type="regex"
  • pattern="[\d,]*"
  • default=""
  • />
  •  
  •  
  • <!--- Check to see if we have any trait. --->
  • <cfif Len( FORM.list_trait_id )>
  •  
  • <!--- Query for girls. --->
  • <cfquery name="qGirl" datasource="ben">
  • SELECT
  • g.id,
  • g.name
  • FROM
  • girl g
  • INNER JOIN
  • (
  •  
  • <!---
  • Get only girl IDs from the JOIN table where
  • ALL of the trait IDs match. Because we know
  • that the JOIN records are unique, we know
  • that the COUNT(*) on the GROUP BY must equal
  • the length of the trait ID list.
  • --->
  • SELECT
  • gtjn.girl_id
  • FROM
  • girl_trait_jn gtjn
  • WHERE
  • gtjn.trait_id IN ( <cfqueryparam value="#FORM.list_trait_id#" cfsqltype="cf_sql_integer" list="true" /> )
  • GROUP BY
  • gtjn.girl_id
  • HAVING
  • COUNT( * ) = <cfqueryparam value="#ListLen( FORM.list_trait_id )#" cfsqltype="cf_sql_integer" />
  •  
  • ) AS filter
  • ON
  • g.id = filter.girl_id
  • ORDER BY
  • g.name ASC
  • </cfquery>
  •  
  • </cfif>
  •  
  •  
  • <!--- Query for all traits. --->
  • <cfquery name="qTrait" datasource="ben">
  • SELECT
  • id,
  • name
  • FROM
  • trait
  • ORDER BY
  • name ASC
  • </cfquery>
  •  
  •  
  • <cfoutput>
  •  
  • <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  • <html>
  • <head>
  • <title>Select Records My ID Set Matching</title>
  • </head>
  • <body>
  •  
  • <h1>
  • Select Traits
  • </h1>
  •  
  • <form action="#CGI.script_name#" method="post">
  •  
  • <p>
  • <cfloop query="qTrait">
  •  
  • <label>
  •  
  • <input
  • type="checkbox"
  • name="list_trait_id"
  • value="#qTrait.id#"
  • <cfif ListFind( FORM.list_trait_id, qTrait.id )>
  • checked="true"
  • </cfif>
  • />
  •  
  • #qTrait.name#
  •  
  • </label>
  • <br />
  •  
  • </cfloop>
  • </p>
  •  
  • <p>
  • <input type="submit" value="Get Girls!" />
  • </p>
  •  
  • </form>
  •  
  •  
  • <!--- Check to see if we have a search. --->
  • <cfif Len( FORM.list_trait_id )>
  •  
  • <h2>
  • Matching Girls
  • </h2>
  •  
  • <ul>
  • <cfloop query="qGirl">
  •  
  • <li>
  • #qGirl.name#
  • </li>
  •  
  • </cfloop>
  • </ul>
  •  
  • </cfif>
  •  
  • </body>
  • </html>
  •  
  • </cfoutput>

The technique that I use leverages the fact that the join table has unique records, meaning that the combinations of girl_id and trait_id are all unique. Because of this, I know that I can filter the join table using the list of selected trait IDs and then GROUP it by the girl_id. If the COUNT(*) of the resultant GROUP is equal to the list length of the selected traits, I know that the current girl matches all necessary traits. If the GROUP COUNT(*) is less than the length of the trait ID list, then I know that at least one of the selected traits was not found in the join table for the given girl.

The benefit of doing it this way is that you only have to filter the join table once for the query and it allows you to do some filtering early on. If we put any of this in the WHERE clause, it would mean we'd have to be running queries for every girl returned. Yes, the WHERE-style execution might make better use of indexes, but I feel that the sheer number of sub-queries required outweighs that; of course, that will probably depend on the number of joins and the expected result set size.

There are a number of ways to solve this problem and this is the one that I have found to be the most easy to write, read, and maintain. Of course, I expect Rick Osborne to come in at some point and comment on how this is completely inefficient; but, until that happens, the above is my best advice.

Download Code Snippet ZIP File

Post Comment  |  Ask Ben  |  Permalink  |  Print Page



Learning ColdFusion 9 - ColdFusion 9 tutorials, samples, examples, demos

Reader Comments

Apr 23, 2009 at 4:58 PM // reply »
3 Comments

Thanks for the insight Ben!

We have an advanced search page we're building for a client, and were thinking of ways of doing something similar to your example. Your solution's nice and streamlined. I look forward to hearing what others have to say.

Your blog's a huge help and great training reference. Thanks again.


Apr 23, 2009 at 5:11 PM // reply »
6,516 Comments

@Steve,

Always glad to help out. I hope we get some good conversation going on this.


Apr 24, 2009 at 9:34 AM // reply »
3 Comments

Ben,

you really rock !! Whatever I get to do in CF, i search in your site first than in Google and i find a solution for sure.

I was working on a search with ranking based on relevance, and there you go. You readily had my solution, though my search was not for girls ;)


Apr 24, 2009 at 10:22 AM // reply »
6,516 Comments

@Sanjeev,

Awesome! I'm always pumped up to know that this stuff is helpful!


Apr 24, 2009 at 1:53 PM // reply »
3 Comments

Really Cool Ben


Jun 22, 2009 at 2:01 PM // reply »
5 Comments

Absolutely wonderful! I searched online for 3 days trying to figure this out - trying combinations of ANDs, CONTAINS, etc.

I highly recommend visitors watch the short video, as I didn't realize the importance of the COUNT(*) initially and excluded it.

It's very clever to pull in a list matching "any" (IN clause) and then filter it by checking to see if the list of "any" items equals the amount (COUNT) of pre selected options - in this case "traits"

BRAVO!


Jun 22, 2009 at 2:03 PM // reply »
5 Comments

I forgot to mention that I did not get a unique result-set, I did have to use a DISTINCT keyword on the first SELECT

SQL Server 2005


Jun 22, 2009 at 2:07 PM // reply »
5 Comments

I am providing my statement as thanks :)
I have a tblDocument, tblTag, and a tblDocument_Tag linking them - be sure to adjust the COUNT when you add, remove values to match

SELECT DISTINCT doctag.dt_tagKey
FROM tblDocument_Tag AS doctag
INNER JOIN
(
SELECT dt_tagKey
FROM tblDocument_Tag AS tmp
WHERE (dt_docKey IN (17, 18, 44, 21)
)
GROUP BY dt_tagKey
HAVING (COUNT(*) = 4)) AS filter
ON doctag.dt_tagKey = filter.dt_tagKey


Jun 22, 2009 at 2:22 PM // reply »
6,516 Comments

@Ryan,

Always glad to help! Seems like you really got some good use out of this.


Jun 22, 2009 at 5:55 PM // reply »
5 Comments

Crap! I just rewatched the video at home (as opposed to work) and discovered that the GROUP BY needs to be inside the "inline\temp" selection, no wonder it worked perfectly in SQL Manager Studio yet returned seemingly random results in my application :)

I can't gaurentee the validity of this statement, (because I'm at home) but I believe using my dataset from before the CORRECT statement would be:

SELECT DISTINCT doctag.dt_tagKey
FROM tblDocument_Tag AS doctag
INNER JOIN
(
SELECT dt_tagKey
FROM tblDocument_Tag AS tmp
WHERE (dt_docKey IN (17, 18, 44, 21)
GROUP BY dt_tagKey
HAVING (COUNT(*) = 4)) AS filter
)
ON doctag.dt_tagKey = filter.dt_tagKey

and then to go one step further and actually get all the necessary "Tag" information by packaging this SQL inside one more, which is simple, since you already have a reference to the tag you want, so you don't need any fancy filtration tricks, since the only data available to you now...is the correct data.

SELECT *
FROM tblTAG
WHERE (tagKey IN
(
SELECT DISTINCT doctag.dt_tagKey
FROM tblDocument_Tag AS doctag
INNER JOIN
(
SELECT dt_tagKey
FROM tblDocument_Tag AS tmp
WHERE (dt_docKey IN (17, 18, 44, 21)
GROUP BY dt_tagKey
HAVING (COUNT(*) = 4)) AS filter
)
ON doctag.dt_tagKey = filter.dt_tagKey
)
)

This *should* return all the tags which have document(s) in common with ALL the existing\selected Tags - thereby allowing you to filter out documents based on which tags you've selected - so if you chose: "vegetable", you would get a list of veggies, and if you chose "red" you would retrieve a list of all the "vegetables" that are "red" - i.e. Tomatoes, Radishes, and Red Beans.

if you were then to select an additional tag: "fruit" you would see only "Tomatoes" since that one could be logically filed under "Vegetable", "Red" and "Fruit"

Am I right Ben?


Jun 23, 2009 at 8:42 AM // reply »
6,516 Comments

@Ryan,

You have some parenthesis in places that I don't fully understand (wrapping the WHERE clause to rename as filter). I am not sure what that is doing. My gut was thinking you are renaming the deeply nested SELECT result as filter, but the parens are confusing me.

I don't quite follow what the middle SELECT is doing. All you are gonna do is end up joining one tag back to itself. I am not sure the middle SELECT is adding anything.


Jun 23, 2009 at 6:28 PM // reply »
5 Comments

This is what ended up working for me, I'll explain how I see it, and if you can tell me where my logic is incorrect I would really appreciate
it...first the lump:

Application files Documents by "Tags" SO theres a Document, Tag, and Document_Tag table joining them

I want the user to be able to select a Tag and have it populate which tags have a document in common with ALL previously selected Tag (a similar, easier query fetches the documents)

SELECT t.*
FROM tblTag AS t
INNER JOIN
(
SELECT DISTINCT dt_tagKey
FROM tblDocument_Tag AS tmp2
WHERE (dt_docKey IN
(
SELECT dt_docKey FROM tblDocument_Tag AS tmp
WHERE (dt_tagKey IN
(4)
)
GROUP BY dt_docKey
HAVING (COUNT(*) = 1)
)
)
) AS filter ON t.tagKey = filter.dt_tagKey
WHERE (tagKey <> 4)

------

So, first I get the keys of the documents that are connected to Tag #4

-QUERY A-
SELECT dt_docKey FROM tblDocument_Tag AS tmp
WHERE (dt_tagKey IN (4))
^-List,of,Tags

Then using that record set I grab all the tags which have a connected doc that matches ALL the selected tags (thanks to the COUNT TRICK)-

-QUERY B-
SELECT DISTINCT dt_tagKey
FROM tblDocument_Tag AS tmp2
WHERE (dt_docKey IN
(

-QUERY A-

)
GROUP BY dt_docKey
HAVING (COUNT(*) = 1)
^-amount of selected tags

Now I should have all the Keys of Tags that have a document associated with all the previously selected tags (filtering down results)

Finnaly, I use the INNER JOIN to relate the keys to the actual Tag Table and grab everything I want to know about them, excluding the Tag I Just clicked, and ALL previous Tags By adding AND <>'s after the initial WHERE

SELECT t.*
FROM tblTag AS t
INNER JOIN
(

-QUERY B-

) AS filter ON t.tagKey = filter.dt_tagKey
WHERE (tagKey <> 4)

...AND (tagKey <> 16)
AND (tagKey <> 22)
AND (tagKey <> 19)...

It appears to work, except that it doesn't remove tags that have the exact SAME documents, so you can click more Tags and the results don't change - I assume the solution would be to test and see if the matching amount of docs in a Tag is equal to what is already displayed, and then just exclude that tag from being printed

As for all the extra parenthesis around WHERE's and AND's, these were auto inserted by SQL Server Management Studio, when I was developing the query probably their own take on "correct" T-SQL

If I excluded them, it would probably work - but I wonder if the database would parse the preferred formating on execution of the query (in my application, because it certainly does in the SQL pane in SQL SMS, and if that would cause some sort of performance hit?

Any help clearing up any misnomers I have would be appreciated.


Post Comment  |  Ask Ben

Recent Blog Comments
Nov 21, 2009 at 11:03 AM
Groovy Operator Overloading Does Not Work In The ColdFusion Context
Hi Ben, Thanks for this informative post. Now I am reading ur old posts too ... read »
Nov 21, 2009 at 10:56 AM
HostMySite.com Has The Best ColdFusion Hosting
@Mehul, Yes very nice people, however several downtimes per day which was not acceptable. Hence we had to move out. I am glad you are having good luck with them so far. ... read »
Nov 20, 2009 at 11:32 PM
Five Months Without Hungarian Notation And I'm Loving It
I've used headless camel case for years for not only ColdFusion variables, but also SQL tables and fields... pretty much everything involving code. I also subscribe to the "don't abbreviate and clea ... read »
Nov 20, 2009 at 11:00 PM
Five Months Without Hungarian Notation And I'm Loving It
@Marcel, Yeah, I always err on the side of longer but more readable variable names. As for the camel casing of CF methods and the headless camel casing of custom items, I get around this by always ... read »
Nov 20, 2009 at 10:56 PM
Five Months Without Hungarian Notation And I'm Loving It
I use the following and love it: my.namespace.MyComponents.functionMethodsOrUDF() CONSTANT_VALUES_OR_PROPERTIES One thing I always try is to CamelCaseBuiltInColdFusionFunctions() so others can tell ... read »
Nov 20, 2009 at 5:38 PM
Learning ColdFusion 8: CFImage Part I - Reading And Writing Images
Hi Ben, Great article. I've been looking around to see if ColdFusion image engine can programatically create the following "wrap around" effect: http://www.creativepro.com/article/photoshop-s-she ... read »
Nov 20, 2009 at 5:35 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
@Dave: I talked to Gert he suggested: <cfhttp method="get" url="http://{some cf website}" result="stuff" addtoken="yes" /> Note the addition of cfhttp attribute addtoken. That should persist y ... read »