For a current project I needed to create a table that contained a list of exclusions that should be filtered out from another table when a specific set of criteria were met. This is a pretty common scenario where you have a large table of data and you want to be able to provide the ability through a user interface for a user to filter the data using an exclusion list.
I have access to an amazing SQL guru, Peter Loos, who was able to help me brainstorm the best approach for altering existing stored procedures to include a filter for information included in the new exclusion table I added for this feature. During the course of speaking with Peter I learned the following:
!= operator is being deprecated, we need to use <> from here on out.
TSQL NOT IN performance sucks; a JOIN should be used in place of NOT IN if at all possible.
My initial solution was do modify the existing stored procedures which returned the list that I needed to filter with my exclusion table by adding a WHERE clause that included a NOT IN (SELECT…) statement. After speaking with Peter it became clear that NOT IN could cause performance problems down the road, so I went down the path of figuring out how to do this with a JOIN statement.
Example Exclusion Table
In this example of how to setup an exclusion table in SQL, we will use an address book example. Lets say I have an address book of all of the people I know and their contacts like the following:
AddressBook Table
SQL Table Definition
CREATE TABLE #AddressBook(AddressBookId INT PRIMARY KEY
,FullName varchar(100) NULL
,Phone varchar(20) NULL
,Location varchar(200) NULL)
INSERT INTO #AddressBook VALUES (1, 'Jack Bauer', '853-234-5968', 'Los Angeles, CA')
INSERT INTO #AddressBook VALUES (2, 'John Stewart', '451-856-4468', 'New York, NY')
INSERT INTO #AddressBook VALUES (3, 'Jim Cramer', '452-938-7228', 'New York, NY')
INSERT INTO #AddressBook VALUES (4, 'Stephen Colbert', '451-467-3333', 'New York, NY')
INSERT INTO #AddressBook VALUES (5, 'Bernie Madoff', '609-611-4468', 'Federal Prison, PA')
SELECT * FROM #AddressBook
Resultset
NotMyFriends Table
Now lets say I want to filter my address book down to a list of just people I consider my friends, so I create an exclusion table and enter those contacts that I DON’T consider to be friends.
SQL Table Definition
CREATE TABLE #NotMyFriends(NotMyFriendsId INT PRIMARY KEY
,AddressBookId INT)INSERT INTO #NotMyFriends VALUES (1,3)
INSERT INTO #NotMyFriends VALUES (2,5)
SELECT * FROM #NotMyFriends
Resultset
Filtered Resultset
So the end goal of creating the NotMyFriends exclusion table is to be able to filter the AddressBook table, removing anyone listed in the NotMyFriends table so my final resultset is a list of contact information for just my friends.
NOT IN Solution
The first way I thought of doing this was to use a NOT IN statement, which filtered the results just like I wanted. The only problem with this solution (as I learned from Peter Loos) was the NOT IN statement performance sucks.
SQL NOT IN Query
SELECT * FROM #AddressBook AS MyFriends
WHERE MyFriends.AddressBookId NOT IN (SELECT AddressBookId from #NotMyFriends)
Resultset
JOIN Solution
So now I wanted the exact same results as I got using the NOT IN statement, but I want to use a JOIN instead to improve the performance of my query.
SQL JOIN Query
SELECT * FROM #AddressBook AS MyFriends
LEFT JOIN #NotMyFriends
ON MyFriends.AddressBookId = #NotMyFriends.AddressBookIdResultset
We are almost there, except the filter didn’t work. The LEFT JOIN combined our full table of addresses & our exclusion table list, but it didn’t filter the results. Looking at the results above we need to add a WHERE clause to our LEFT JOIN.
SQL JOIN Query with WHERE Clause
SELECT *
FROM #AddressBook AS MyFriendsLEFT JOIN #NotMyFriendsON MyFriends.AddressBookId = #NotMyFriends.AddressBookIdWHERE #NotMyFriends.AddressBookId IS NULL
Query Script Example Download
As you can see, the final SQL JOIN query that included the WHERE clause gets us the filtered results we wanted from the exclusion table contents and performs well to.
I created a sample TSQL script that uses temp tables to demonstrate the queries I posted above. You can download it here:
Aaron Schnieder
http://www.churchofficeonline.com
2 comments:
Provided I'm reading it right, there's an easier way: NOT EXISTS. Performance-wise it should perform like that of the join, rather than that of NOT IN.
select * from a where NOT EXISTS (select * from b where a.key = b.key)
I believe this is called a Set Difference algorithm.
I've seen several approaches, but your approach, the JOIN + IS NULL on the LEFT JOINed table, is my preference. I suspect your approach would be faster than the NOT EXISTS sub-query approach as it doesn't require a per row subquery and NOT EXISTS can be very slow on large tables, though I haven't run any performance tests myself.
On a related note, you can also get the set difference of two list with Linq using Except().
Post a Comment