Tuesday, March 17, 2009

Creating a SQL Exclusion Table Using LEFT JOIN

For a current project I needed to create a table that contained a list of exclusions that should be filtered out from another table when a specific set of criteria were met. This is a pretty common scenario where you have a large table of data and you want to be able to provide the ability through a user interface for a user to filter the data using an exclusion list.

I have access to an amazing SQL guru, Peter Loos, who was able to help me brainstorm the best approach for altering existing stored procedures to include a filter for information included in the new exclusion table I added for this feature. During the course of speaking with Peter I learned the following:

!= operator is being deprecated, we need to use <> from here on out.

TSQL NOT IN performance sucks; a JOIN should be used in place of NOT IN if at all possible.

My initial solution was do modify the existing stored procedures which returned the list that I needed to filter with my exclusion table by adding a WHERE clause that included a NOT IN (SELECT…) statement. After speaking with Peter it became clear that NOT IN could cause performance problems down the road, so I went down the path of figuring out how to do this with a JOIN statement.

Example Exclusion Table

In this example of how to setup an exclusion table in SQL, we will use an address book example. Lets say I have an address book of all of the people I know and their contacts like the following:

AddressBook Table

SQL Table Definition


                            ,FullName varchar(100) NULL

                            ,Phone varchar(20)    NULL

                            ,Location varchar(200) NULL)

INSERT INTO #AddressBook VALUES (1, 'Jack Bauer', '853-234-5968', 'Los Angeles, CA')

INSERT INTO #AddressBook VALUES (2, 'John Stewart', '451-856-4468', 'New York, NY')

INSERT INTO #AddressBook VALUES (3, 'Jim Cramer', '452-938-7228', 'New York, NY')

INSERT INTO #AddressBook VALUES (4, 'Stephen Colbert', '451-467-3333', 'New York, NY')

INSERT INTO #AddressBook VALUES (5, 'Bernie Madoff', '609-611-4468', 'Federal Prison, PA')


SELECT * FROM #AddressBook



NotMyFriends Table

Now lets say I want to filter my address book down to a list of just people I consider my friends, so I create an exclusion table and enter those contacts that I DON’T consider to be friends.

SQL Table Definition


                            ,AddressBookId INT)

INSERT INTO #NotMyFriends VALUES (1,3)

INSERT INTO #NotMyFriends VALUES (2,5)


SELECT * FROM #NotMyFriends



Filtered Resultset

So the end goal of creating the NotMyFriends exclusion table is to be able to filter the AddressBook table, removing anyone listed in the NotMyFriends table so my final resultset is a list of contact information for just my friends.

NOT IN Solution

The first way I thought of doing this was to use a NOT IN statement, which filtered the results just like I wanted. The only problem with this solution (as I learned from Peter Loos) was the NOT IN statement performance sucks.



FROM #AddressBook AS MyFriends

WHERE MyFriends.AddressBookId NOT IN (SELECT AddressBookId from #NotMyFriends)



JOIN Solution

So now I wanted the exact same results as I got using the NOT IN statement, but I want to use a JOIN instead to improve the performance of my query.



FROM #AddressBook AS MyFriends

LEFT JOIN #NotMyFriends

ON MyFriends.AddressBookId = #NotMyFriends.AddressBookId



We are almost there, except the filter didn’t work. The LEFT JOIN combined our full table of addresses & our exclusion table list, but it didn’t filter the results. Looking at the results above we need to add a WHERE clause to our LEFT JOIN.

SQL JOIN Query with WHERE Clause


FROM #AddressBook AS MyFriends

LEFT JOIN #NotMyFriends

ON MyFriends.AddressBookId = #NotMyFriends.AddressBookId

WHERE #NotMyFriends.AddressBookId IS NULL



Query Script Example Download

As you can see, the final SQL JOIN query that included the WHERE clause gets us the filtered results we wanted from the exclusion table contents and performs well to.

I created a sample TSQL script that uses temp tables to demonstrate the queries I posted above. You can download it here:

Aaron Schnieder



bourgon said...

Provided I'm reading it right, there's an easier way: NOT EXISTS. Performance-wise it should perform like that of the join, rather than that of NOT IN.

select * from a where NOT EXISTS (select * from b where a.key = b.key)

Adam Anderson said...

I believe this is called a Set Difference algorithm.

I've seen several approaches, but your approach, the JOIN + IS NULL on the LEFT JOINed table, is my preference. I suspect your approach would be faster than the NOT EXISTS sub-query approach as it doesn't require a per row subquery and NOT EXISTS can be very slow on large tables, though I haven't run any performance tests myself.

On a related note, you can also get the set difference of two list with Linq using Except().

Anonymous said...

Great Post! Im 90% what I was trying to do. Basically I have
a master [AddressBook] table that
5 people use. So I have a [Users] table. I also have a master [Exclusion] table. The [AddressBook] table has no reference to an individual user, all users would see all address.
If any user desires not to see an
individual name, I want to put that in the [Exclusion] table.