Ask Ben: Querying Groups With Conditional Elements

By Ben Nadel

Published 2008-01-13 in Ask Ben, ColdFusion, SQL — Comments (2)

I hav a situation in my academic project, and was struggling to get a proper approach towards the solution, then i decided to ask 'kind-hearted' ppl for help ... while browsing net for forums wid posts with similar situation, i 'stumbled' upon you site. it's a nice and great job u r doin here ben. thanks a ton... do consider to direct me towards the solution for the following situation.

The following table reflects the One-to-Many relationship between TransID and TagID
--------------------  
|TransID | TagID |  
-------------------- 
| 1 | 1 |  
| 1 | 2 |  
| 2 | 1 |  
| 2 | 2 |  
| 2 | 3 |  
| 3 | 3 |  
| 3 | 4 |  
--------------------
i would need to know how exactly i should be aproaching the problem to find out solution for following situations:
STRICT MODE:  
------------  
  
\* TransID with TagID \(1 AND 4 AND Not 2\)  
Result : i should get only TransID:3 as the result. (ALL Conditions shud return true, so in the interface the user would have selected Having 1,4 and not having 2).
Relaxed Mode:  
-------------  
  
\* TransID with TagID \(1 OR 4 OR Not 2\)  
Result : i should get TransID:1,2,3 as the result, since each of the TransID satisfies one or other condition. (ANY Condition returning true the TransID qualifies to be returned, and in the interface the user would hav selected Having 1,4 and not Having 2)

Before we get into this, let's start out by building a SQL temp table into which I can insert the test data:

<!---
	Create the SQL build scripts so that we can reuse it
	within the different CFQuery tags.
--->
<cfsavecontent variable="strBuildSQL">
	DECLARE @data TABLE (
		trans_id INT,
		tag_id INT
	);


	<!--- Populate data. --->
	INSERT INTO @data
	(
		trans_id,
		tag_id
	)(
		SELECT 1, 1 UNION ALL
		SELECT 1, 2 UNION ALL
		SELECT 2, 1 UNION ALL
		SELECT 2, 2 UNION ALL
		SELECT 2, 3 UNION ALL
		SELECT 3, 3 UNION ALL
		SELECT 3, 4
	);
</cfsavecontent>

Now that we have that in place, our @data table should mimic what the user has described (although I have changed the column names to line up more appropriately with my own database naming conventions).

Ok, so let's take a look at the problem. For the Strict mode part, the first thing that jumps to mind is grouping. We want to check out the group of TransIDs and make sure that each group contains either TagID 1 or 4, but definitely not TagID 2. To examine groups in SQL Server, we can use the GROUP BY clause. Then, to examine the contents of each group, we can leverage the fact that the COUNT() aggregate can handle CASE statements. If you remember my exploration of the COUNT() / CASE scenario, you will remember that the COUNT() aggregate does not include NULL values in its tally. Therefore, if we can conditionally include values, we can use the COUNT() of each group to narrow down the selection.

Taking this understanding and applying it to the problem at hand, we want to make sure that each TransID group has at least one count of the TagID 1 or 4 and a zero count of the TagID 2:

<!--- Query for TransID values. --->
<cfquery name="qStrict" datasource="#REQUEST.DSN.Source#">
	<!--- Build SQL table and populate. --->
	#PreserveSingleQuotes( strBuildSQL )#


	<!--- Get the appropriate trans IDs. --->
	SELECT
		d.trans_id
	FROM
		@data d
	GROUP BY
		d.trans_id
	HAVING

		<!---
			Make sure that the group has either
			tag ID 1 or 4.
		--->
		COUNT(
			CASE
				WHEN
					d.tag_id IN ( 1,4 )
				THEN
					d.tag_id
				ELSE
					NULL
			END
		) > 0
	AND
		<!---
			Make sure that the group does not also contain
			the tag id, 2.
		--->
		COUNT(
			CASE
				WHEN
					d.tag_id = 2
				THEN
					d.tag_id
				ELSE
					NULL
			END
		) = 0
	;
</cfquery>


<!--- Output out results. --->
TransID: #ValueList( qStrict.trans_id, ', ' )#

Running this SQL, we get the following output:

TransID: 3

This works, but it looks really complicated. We can start to simplify the query if we look at the problem in a slightly different way. We can think of the 1 and 4 requirements as filter and then the NOT 2 as a secondary layer of filter. Let's run a query that includes just the first part of that filtering:

<!--- Query for TransID values. --->
<cfquery name="qStrict" datasource="#REQUEST.DSN.Source#">
	<!--- Build SQL table and populate. --->
	#PreserveSingleQuotes( strBuildSQL )#


	<!--- Get the appropriate trans IDs. --->
	SELECT
		d.trans_id
	FROM
		@data d
	WHERE
		d.tag_id IN ( 1,4 )
	GROUP BY
		d.trans_id
	;
</cfquery>


<!--- Output out results. --->
TransID: #ValueList( qStrict.trans_id, ', ' )#

This gives us the following output:

TransID: 1, 2, 3

Good, the first part works. Now, we want to apply the second filter, not having TagID 2. We can't just add another simple WHERE clause for != 2 since any TagID that is 1 or 4 is ALSO not equal to 2. What we need to do is apply the not 2 to each group of TransIDs. To do this, we are going to add a NOT EXISTS clause:

<!--- Query for TransID values. --->
<cfquery name="qStrict" datasource="#REQUEST.DSN.Source#">
	<!--- Build SQL table and populate. --->
	#PreserveSingleQuotes( strBuildSQL )#


	<!--- Get the appropriate trans IDs. --->
	SELECT
		d.trans_id
	FROM
		@data d
	WHERE
		d.tag_id IN ( 1,4 )
	AND
		<!---
			Make sure that this TransID value does not
			exists in a group that ALSO contains a TagId, 2.
		--->
		NOT EXISTS
		(
			SELECT
				1
			FROM
				@data d2
			WHERE
				d2.trans_id = d.trans_id
			AND
				d2.tag_id = 2
		)
	GROUP BY
		d.trans_id
	;
</cfquery>


<!--- Output out results. --->
TransID: #ValueList( qStrict.trans_id, ', ' )#

Running the above code, we get the following output:

TransID: 3

This version works as well. To me, this is somewhat easier to read than the leveraged COUNT() aggregate; however, I am not sure exactly which one is going to be faster. I am sure that it will depend somewhat on the size of the table and on the indexes that are in place.

Now that we have the Strict Matching out of the way, let's take a look at the Relaxed Matching mode. Relaxed matching is much easier as it doesn't really apply to the group level, but rather to the individual records being examined. As such, we can simply use a straightforward WHERE clause to limit each returned TransID and then group by the TransID to make sure we don't get duplicate values:

<!--- Query for TransID values. --->
<cfquery name="qRelaxed" datasource="#REQUEST.DSN.Source#">
	<!--- Build SQL table and populate. --->
	#PreserveSingleQuotes( strBuildSQL )#


	<!--- Get the appropriate trans IDs. --->
	SELECT
		d.trans_id
	FROM
		@data d
	WHERE
		d.tag_id = 1
	OR
		d.tag_id = 4
	OR
		d.tag_id != 2
	GROUP BY
		d.trans_id
	;
</cfquery>


<!--- Output out results. --->
TransID: #ValueList( qRelaxed.trans_id, ', ' )#

Running the above code, we get the following output:

TransID: 1, 2, 3

Works fine and as you can see, relaxed mode matching is much easier and faster to perform.

I hope that helps. Please feel free to follow up with comments below.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/1121

Reader Comments

Vasanth Jan 15, 2008 at 10:11 AM

1 Comments

Very thanks for the solution. please consider the situation,

if the 'Strict mode' is made more strict,

i.e when TransID with TagID (1 AND 4 AND Not 2) means :
TransIDs are expected to have both TagID 1 AND 4 (in the solution you coded, using TagID (1,4), will result in '1 OR 4')

so only when user is searching for TransID with TagID (1 AND 2 AND NOT 3) only the TransID: 1 satisfies the condition and should be returned ...

i am using a query like

SELECT TransID
FROM Trans_Tags
WHERE TransID IN (SELECT TransID in Trans_Tags WHERE TagID = 1) AND
WHERE TransID IN (SELECT TransID in Trans_Tags WHERE TagID = 2)
AND
WHERE TransID NOT IN (SELECT TransID in Trans_Tags WHERE TagID = 3)

will this query run efficiently in a huge tables with say a million records ?! or could you suggest me any better approach ?!

thanks for your efforts.

Ben Nadel Jan 25, 2008 at 9:47 AM

15,902 Comments

@Vasanth,

I have been told that it is more efficient to use the EXISTS construct rather that the IN construct. Also, if the tables are indexed on the proper columns, then the look up should be fairly fast.

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.