Ask Ben: Pulling SQL Records Based On A List Of IDs
Looping over a list - hey and thank you for all your help your the best!!!! I have a list I created 6,7,4,29,32,12 etc ... #theList# ... Each number is a record ID from the database. Now I want to loop over the list and pull a field from the record based on the ID. I cant get it man.
When I first got this email, I thought I would just whip up a quick example and send it on its way. But, then I got to thinking about it; there are several ways to accomplish this task, each of which has its own pros and cons and consists of a different level of complexity. And so, I thought I would cover a few different methods so people can see different ways to attack the same problem.
I think the solution that comes most naturally to people, esepcially those who are newer to programming with a database, is the one in which we iterate over the ID list and then for each list iteration, we make a call to the database for the corresponding record:
<!---
Create a list of ID's who's fields we want to
pull out of the database.
--->
<cfset lstID = "6,7,4,29,32,12" />
<!---
Loop over the list so that we can get the record
for each ID.
--->
<cfloop
index="intID"
list="#lstID#"
delimiters=",">
<!--- Query for record. --->
<cfquery name="qBlog" datasource="#REQUEST.DSN.Source#">
SELECT
b.id,
b.name,
b.date_posted
FROM
blog_entry b
WHERE
b.id = <cfqueryparam value="#intID#" cfsqltype="cf_sql_integer" />
</cfquery>
<!--- Output the row ID and name. --->
<p>
[ #qBlog.id# ]
#qBlog.name#
</p>
</cfloop>
Running the above code, we get the following output:
[ 6 ] FireFox and XStandard CSS File Linkage
[ 7 ] Over-Complicating Getting the Rest of List
[ 4 ] Trouble with XStandard and FireFox
[ 29 ] Getting Back Into The Gym
[ 32 ] Getting Group By Count In Left Outer Join
[ 12 ] Thanks to the Texas Web Developers
As you can see, all 6 IDs resulted in a record pulled from the database. Additionally, because we were looping over the list, each record is pulled in the same order in which the IDs appeared in the list.
This solution is good because it is simple and very easy to understand. It's also a tiny amount of code. This is all very appealing at first; but, as you start to evaluate code performance, you will notice that database calls are expensive and here, we are making a separate database calls for every single ID in our list. For a list of 6 IDs, this might not be so bad, but as our ID list grows, this is going to become more noticeable.
To overcome the multiple database calls problem, the next solution combines all the database calls into one query that returns a record for each ID in the list. The easiest way to accomplish this is to use the SQL IN clause. IN takes a comma delimited list and is used for existence checking.
<!---
Create a list of ID's who's fields we want to
pull out of the database.
--->
<cfset lstID = "6,7,4,29,32,12" />
<!--- Query for record. --->
<cfquery name="qBlog" datasource="#REQUEST.DSN.Source#">
SELECT
b.id,
b.name,
b.date_posted
FROM
blog_entry b
WHERE
b.id IN
(
<!---
Pass in the ID list as a listed query param.
The LIST attribute will cause this to act as
a comma delimited list in which each value
gets bound to an item in the list.
--->
<cfqueryparam value="#lstID#" cfsqltype="cf_sql_integer" list="true" />
)
ORDER BY
(
CASE
b.id
<!---
Loop over the ID list and ensure that the
records are ID-ordered in the same order
they are in the list.
--->
<cfloop
index="intIndex"
from="1"
to="#ListLen( lstID )#">
WHEN
#ListGetAt( lstID, intIndex )#
THEN
#intIndex#
</cfloop>
ELSE
-1
END
) ASC
</cfquery>
<!--- Loop over returned records. --->
<cfloop query="qBlog">
<!--- Output the row ID and name. --->
<p>
[ #qBlog.id# ]
#qBlog.name#
</p>
</cfloop>
The problem with the IN clause is that it only helps to filter the result set and has no impact on the order in which the IDs are returned (by default, the records are returned in the same order in which the SQL server reads them from disk). And, since I am assuming ID order is important (since the reader's ID list was NOT in an ascending order), we need to create a fairly complicated ORDER BY clause to get the records returned in the proper order.
If the ORDER BY statement is confusing, basically what it is doing is checking the record ID against the list of IDs and is telling the SQL server to order the record based on the index of the given ID in the ID list.
Running the above code, you will see that we get the same output:
[ 6 ] FireFox and XStandard CSS File Linkage
[ 7 ] Over-Complicating Getting the Rest of List
[ 4 ] Trouble with XStandard and FireFox
[ 29 ] Getting Back Into The Gym
[ 32 ] Getting Group By Count In Left Outer Join
[ 12 ] Thanks to the Texas Web Developers
This solution is good because we are getting all of the same information as we were in the first one, but instead of 6 separate database calls, we now only have one. This will be a nice performance boost. The caveat here is that when we output the data, we are looping over a query not our ID list. The con here is that the meat of the query is way more complex that it was in the first solution. This can be distracting to the overall meaning of the query and can make debugging a pain in the butt. Plus, as the ID list gets longer and longer (not that that is going to happen), the SQL IN clause becomes less and less efficient. And of course, there is much more code here to deal with in general.
The next solution attempts to overcome the cons of the previous solution by moving the complex logic out of the main query using an intermediary, in-memory table. Using just a single query, we are going to create a temporary table that has both the ID and the position (sort index) of all the IDs in the list. Our main query is then going to just join to that temp table.
<!---
Create a list of ID's who's fields we want to
pull out of the database.
--->
<cfset lstID = "6,7,4,29,32,12" />
<!---
Query for all the records that correspond to
an ID in our ID list.
--->
<cfquery name="qBlog" datasource="#REQUEST.DSN.Source#">
<!---
Because we are not just dealing with a list of IDs
but also with their order, I think the easist thing
to do would be to create an in-memory table that has
both the ID value and the sort position. This table
will be called "valid" as in "valid IDs".
--->
DECLARE @valid TABLE (
id INT,
sort INT
);
<!---
Now that we have our table declared, let's loop over
the IDs and populate it.
--->
INSERT INTO @valid (
id,
sort
)(
<cfloop
index="intIndex"
from="1"
to="#ListLen( lstID )#"
step="1">
SELECT
#ListGetAt( lstID, intIndex )#,
#intIndex#
UNION ALL
</cfloop>
<!--- This it to handle the last UNION ALL. --->
SELECT 0,0
);
<!---
ASSERT: The temp table @valid now has all the IDs from
the original list with a SORT column value that
reflects the position of the ID in the original list.
--->
<!---
Now, select all records where we have a given ID.
Since we are assuming the list has no duplicates,
we can accomplish this one-to-one relationship via
an INNER JOIN to our valid ID table.
--->
SELECT
b.id,
b.name,
b.date_posted
FROM
blog_entry b
INNER JOIN
@valid v
ON
b.id = v.id
ORDER BY
v.sort ASC
</cfquery>
<!--- Loop over the returned records. --->
<cfloop query="qBlog">
<!--- Output the row ID and name. --->
<p>
[ #qBlog.id# ]
#qBlog.name#
</p>
</cfloop>
Notice here that the bulk of the logic now goes into populating the temporary table. The main query, the one that returns the ID-based records, is now much more simple. The ORDER BY clause merely uses the sort column in the temp table. The only complex thing about it is that we are filtering the records based on the INNER JOIN to the temp table rather than using any WHERE clause.
Running the above code, you will see that we get the same output:
[ 6 ] FireFox and XStandard CSS File Linkage
[ 7 ] Over-Complicating Getting the Rest of List
[ 4 ] Trouble with XStandard and FireFox
[ 29 ] Getting Back Into The Gym
[ 32 ] Getting Group By Count In Left Outer Join
[ 12 ] Thanks to the Texas Web Developers
The SQL used here is much more advanced in general, but if you assume that the temp table is being populated properly, the chances of making an error in the main query are very low. Also, since we are using a JOIN rather than in IN clause, I believe (although I am not certain) that this solution will outperform the previous one as the list gets bigger.
I think the main cons with this solution are simply that there is a good deal of code to deal with and the SQL being used is someone advanced. And, for a very small list, this might just be overkill.
Each of these solutions, I think, has it's place depending on the context in which it will be used. I am not gonna say that one of these clearly beats the other. People with more experience may be able to speak more definitively about the performance difference between solution 2 and solution 3.
Hope this helps in some way.
EDIT: Suggested by Dustin - ColdFusion Query of Query Solution
Seeing as I love ColdFusion query of queries, I can't believe I totally forgot about this as a possible solution. Here, we are getting all the possible values in an up-front query as we did in solution 2. But this time we don't care about ordering, we just want the pool of records. Then, we can loop over the list and re-query our record set for the appropriate values (notice that when outputting the data, we are referring to the subquery, not the first query).
<!---
Create a list of ID's who's fields we want to
pull out of the database.
--->
<cfset lstID = "6,7,4,29,32,12" />
<!--- Query for record. --->
<cfquery name="qBlog" datasource="#REQUEST.DSN.Source#">
SELECT
b.id,
b.name,
b.date_posted
FROM
blog_entry b
WHERE
b.id IN
(
<!---
Pass in the ID list as a listed query param.
The LIST attribute will cause this to act as
a comma delimited list in which each value
gets bound to an item in the list.
--->
<cfqueryparam value="#lstID#" cfsqltype="cf_sql_integer" list="true" />
)
</cfquery>
<!--- Loop over the list. --->
<cfloop
index="intID"
list="#lstID#"
delimiters=",">
<!---
Query for local record using a ColdFusion Query of
Queries. While this is another SQL call, it does
not communicate with the database.
--->
<cfquery name="qBlogSub" dbtype="query">
SELECT
*
FROM
qBlog
WHERE
id = <cfqueryparam value="#intID#" cfsqltype="cf_sql_integer" />
</cfquery>
<!--- Output the row ID and name. --->
<p>
[ #qBlogSub.id# ]
#qBlogSub.name#
</p>
</cfloop>
Some people get mixed results with ColdFusion query of queries in terms of performance. What you save in the cost of communicating with the database server you pay for in the fact that ColdFusion query of queries performs SQL calls slower than the Database. It becomes a bit of a tweaking and balancing act to see what performs better. I am sure that this is heavily influenced by the load on the server and the number of concurrent connections allowed to the database. In general though, I would certainly suggest erring on the side of FEWER database calls.
EDIT: Suggested by Rick Osborne - Inline Temp Table
Rick Osborne suggested doing a modification of the temp table. In his example (reproduced below), he makes his initial FROM statement an inline temp table comprised of those nested SELECT statements. This is basically what I was doing in my third example; the difference here is that Rick's does not require any additional variables - it is just part of the main query.
<!---
Build an array of IDs rather than a list of IDs (We
can iterate over an array faster than we can a list
and it is more index-friendly).
--->
<cfset arrID = ListToArray( "6,7,4,29,32,12" ) />
<!--- Query for all records. --->
<cfquery name="qBlog" datasource="#REQUEST.DSN.Source#">
SELECT
ids.id,
b.name,
b.date_posted
FROM
<!---
Our intial table will be a dynamic table created
from our sub-select. This is like creating an
in-memory table except we don't need to declare
any variables.
--->
(
<cfloop
index="intIndex"
from="1"
to="#ArrayLen( arrID )#"
step="1">
<cfif (intIndex NEQ 1)>
UNION ALL
</cfif>
SELECT
<cfqueryparam value="#arrID[ intIndex ]#" cfsqltype="cf_sql_integer" /> AS id,
<cfqueryparam value="#intIndex#" cfsqltype="cf_sql_integer" /> AS sort
</cfloop>
) AS ids
LEFT OUTER JOIN
blog_entry AS b
ON
b.id = ids.id
ORDER BY
ids.sort ASC
</cfquery>
<!--- Loop over the returned records. --->
<cfloop query="qBlog">
<!--- Output the row ID and name. --->
<p>
[ #qBlog.id# ]
#qBlog.name#
</p>
</cfloop>
Also notice that Rick went with an array of IDs rather than a list. This is a wise move for our purposes. It is faster to iterate over an array than it is a list. And, when using arrays, I think indexes feel more natural and perform quicker.
Want to use code from this post? Check out the license.
Reader Comments
Another alternative to this would be to do your second example, but instead of looping the query, loop the list and do a query of query pulling the record you need for that iteration. That should keep easy to read, give fairly good performance, and you avoid the complex query.
@Dustin,
Excellent suggestion. I can't believe I totally spaced on the Query of Queries as a possible solution (I love QoQ!). I will try to update this at lunch time.
When I could trust my input data, I have used dynamic SQL queries with an IN statement:
WHERE b.id IN (<cfoutput>#lstID#</cfoutput>)
The IN statement essentially just becomes a series of OR statements in the query.
Most RDBMSes enforce a maximum list size (probably around 256) and you might need to quote the elements in some lists (but it works as-is for numerics).
@Adam,
While I'm not entirely sure about the maximum number of items you can put in the IN statement. I've never had an error tossed from this and the docs for T-SQL states " IN (,...n)" which leads me to believe there isn't a limit (at least for MS-SQL).
That aside you should really use the <cfqueryparam list="yes"...>, this way you get the benefit of using bind variables. This will help prevent any injected SQL from being executed in your list and it will also speed the transaction considerably. As an added bonus you don't have to worry about quoting the values in your list. :D
@Dusin,
I am not sure that I have ever gotten a an error with IN and just #lstID#. However, I have gotten an error when I use CFQueryParam list=yes AND have a huge list. I think there is an upper limit to how many variables can be bound. But, this is not a limitation of the IN () clause, but rather the SQL server in general? Not sure...
Well that *did* sound like something interesting to explore. So I decided to do some testing and I did find out something interesting. If you try to do a QoQ with cfqueryparam and a 4000 item list it crashes CF.
Ooops....
Ha ha ha, nice :) I think if you try the same thing except directly in a SQL server query, it runs out of stack memory or comes back with an "unknown exception" or something (I can't quite remember what the deal is).
Very usefull information! Thank you very much!
For completeness, I should point out:
<cfset arrID=ListToArray(lstID)>
<cfquery name="qBlog" datasource="#REQUEST.DSN.Source#">
SELECT ids.id, b.name, b.date_posted
FROM (
<cfloop from="1" to="#ArrayLen(lstID)#" index="i">
<cfif i GT 1>UNION ALL</cfif>
SELECT <cfqueryparam cfsqltype="CF_SQL_INTEGER" value="#arrID[i]#"> AS id, #i# AS orderID
</cfloop>
) AS ids
LEFT OUTER JOIN blog_entry AS b ON (b.id = ids.id)
ORDER BY ids.orderID
</cfquery>
This has the upshot of not only being a single query, but also returning empty rows if you pass in a bogus ID. (In case you want to output some kind of error.) If you don't want to do that, you can always make it an INNER JOIN instead of a LEFT OUTER JOIN.
It is essentially the same as your second example with the temporary table, but it doesn't create a temporary table. So it would be more portable to smaller DBMSes.
-R
P.S. - If you are on DB2, which doesn't like SELECT without FROM, you can add "FROM sysibm.sysdummy1".
Thanks Ben this was what I finally did to make it work, my solution wasnt to complex.
<cfoutput>
<table cellpadding="0" cellspacing="0" width="800" align="center">
<cfloop list="#theList#" index="i">
<cfquery datasource="cvccvc" name="theOne">
SELECT *
FROM Thes
Where WheelID = #i#
</cfquery>
<tr>
<td>#theOne.Bio#</td>
</tr>
</cfloop>
</table>
</cfoutput>
@Dustin, @Rick,
I have added your two solutions to the post. Thanks for filling in the gaps.
@Wrighter,
As long as your stuff is performing well, I think that's a fine way to go. If you find that it causes an issue later, we can always optimize.
With SQL Server 2005 I believe the limit is 2100 query parameters.
just if anyone was interested, quoted from dbforums:
Actually, this was true for versions earlier than 7.0. Refer to the topic Maximum Capacity Specifications in Books Online. The maximum batch size is listed as 65,536 * Network Packet Size. The default packet size is 4K, so that's pretty big.
http://www.dbforums.com/archive/index.php/t-366606.html
There's another way of doing this in ColdFusion (and probably other languages). You can retrieve a specific cell from your query by using the format QueryName['ColumnName'][RowNumber].
So after your query you might do something like:
<cfloop list="#LstID#" index="i">
<cfset DesiredQueryRow = ListFind(ValueList(qBlog.ID), i) />
<cfif DesiredQueryRow>
<!--- Matching record returned... --->
<p>
[ #qBlog['id'][DesiredQueryRow]# ]
#qBlog['name'][DesiredQueryRow]#
</p>
</cfif>
</cfloop>
When I go looking for ColdFusion answers, I always end up here.
Ben, thanks again for another great explanation and example.