Join Gerry O'Brien for an in-depth discussion in this video Filter duplicates with DISTINCT, part of Querying Microsoft SQL Server 2012.
- View Offline
Often times, even in a single table, duplicate values may exist in the records and in the results returned. The most commonly, you will find this one querying multiple tables with joins and we'll cover that in a chapter on querying multiple tables. Lets focus on trying to filter duplicates with the DISTINCT keyword in SQL server. Now, in a normalized database, that is one that follows rules of normalization, duplicate rows in a single table should not exist due to use of a component known as a primary key.
Now as I said we'll discuss normalization more when we get into the chapter on querying multiple tables. So we don't need to dig into it too deeply just yet, but know that for the most part you might end up with duplicate values in a returned set of records if you're only choosing a subset of the fields. So in other words, if we were to return all of the fields in the database or return all records and just use these select stars in example We wouldn't necessarily notice the duplicate values.
But when we start dropping down in the number of columns as an example that we return, the potential for bringing or seeing the duplicate values really comes to light. And so, in order to better understand how DISTINCT could help us with this. Let's take a look at a couple of samples in SQL server. So assume that your manager has asked you to investigate how many different types of persons exist in the database. So your first search to return all the records in the table will present you with all the fields that exist.
So select star from person.person is a query that we execute. And we notice that from the columns of the fields that we have in our table. We have a Business Entity ID, a Person Type, and then we have other different columns. For Title, First Name, et cetera. Now what we're attempting to do is find out how many different Person Types. So this column is the important aspect. How many different types exist. And the, the problem is that this query doesn't really tell us that right now. And by selecting Star from Person and getting all of the column values.
We see that we've returned 19,972 rows. For you to have to sift through all of those different person types and count the different ones you see, that's a task that not everybody is really interested in doing. So you think well okay, lets change my thought process then. I'm only interested in the person type. So why don't I just return person type? The problem is, because we've filtered our fields down to only the one person type, we're now returning that person type again.
We still have 19,972 rows. So we haven't really gotten to the crux of the problem or what it is we're specifically asking for. We see that we have the Person Type column return but the problem is there's just still way to many rows for us to filter through. We see duplicates. So, we see more than duplicates, actually, we've got multiple entries of EM for the employee type. So, how do we get around that? Well, SQL provides us with the DISTINCT keyword. Let's execute this query and see what happens.
Oh, now we only have six rows. So what the DISTINCT Clause is telling SQL Server is, it's basically saying, I want you to look in the Person Type field in the Person table. And I want you to return Person Type. But only DISTINCT values. On other words, don't give me all of the rows. I only want to see the DISTINCT Person Type entries. So SQL then actually goes in and says I see an IN. I see an EM, an SP, an SC, a VC, and a GC.
Those exist multiple times in the table, but you've told me you only want to see DISTINCT. Or, in other words, one single entry of each type that exists in that table. So this is where DISTINCT helps you reduce those duplicate values that you're actually seeing. With the return results set from your single table. Again, where duplicates exist. So, what happens if we add more fields to that query? So, in other words, we just didn't want person type but we wanted to add in maybe the email promotion that we specify for those, right.
So, you know, management might have come to you and said, hey, we send out email promotions to different types of people in the database. Can you tell me what those email promotions are and how do we send them to those different types of people. Sure we can generate that query, let's execute it and we see that we have person type and email promotion return for columns or in the attributes from our table and we see that we have a person type column the VC, GC, SC, EM, SP, IN, and wait a minute now.
I'm now seeing EM here again. Didn't you say that DISTINCT eliminates duplicate values? So if we look down we see that we see GC and VC a couple times as well as EM. So why is it then, that we're actually seeing duplicate values when we have specified the DISTINCT keyword on the person type. And this is where a true understanding of the data that you have in your database really plays a factor in how you understand the use of some of these keywords and the results that you're seeing returned. Because we're asking for two fields here now, what SQL is doing is it says I will return the DISTINCT person type for you.
But you've also added in an extra field so realistically what I'm going to do is return the DISTINCT person type per email promotion. What does that look like? Well, as an example, let's drop out the DISTINCT key word here for just one second. Before we execute the query note that we had 18 rows return on those two values with the DISTINCT keyword. Let's go ahead and execute this portion of the query one more time. We have 19,972 rows returned. So in other words, all of the rows that were in the person table were still returned.
So there was no DISTINCT filtering built in to this at all. We just simply said give me these two columns. It's a normal query so it doesn't change anything. But when we add DISTINCT back here again, SQL then takes a look at the total query statement that we have executed and it returns the DISTINCT person type by email promotion. Let's look at the results and see what that means. So we have the VC for Email promotion zero, if the DISTINCT keyword is functioning correctly, we should not see another VC and zero email promotion email promotion combination.
Here's a VC and a one, here's a VC and a two, and that's it. And that's it. So we have no more VC's in there. So in other words we have three different email promotion types, a zero, a two, and a one. We will see a representation if it exists in the data, of each one of these person types but unique across email promotion type. So DISTINCT doesn't always give you only a single return value. It depends on how many columns you have selected as to how DISTINCT is applied across those.
Later on, when we start focusing on Grouping By, you'll see how this will actually be even more visually apparent because we can group by e-mail, promotional, person type and it makes your results look a little bit better. We have one more thing we want to look at before we move on from DISTINCT. What if I decided I wanted to get a DISTINCT email promotion and not the person type? Well, SQL server complains and tells us that there's an incorrect syntax near DISTINCT. And if we ignored it and tried to execute the query, it's still going to give us the same error message.
Why is it doing that? Simply because DISTINCT must exist on the first field that you specify in your query. So it works fine here but we can't place it in any place else. So again, thinking about what we've just looked at the DISTINCT clause helps you to filter or reduce the number of duplicate records returned. Notice I say reduce because it won't necessarily eliminate them as we saw in that last query that we executed. DISTINCT still doesn't guarantee that no duplicates will be returned.
if the data dictates otherwise, and in this case is the inclusion of multiple fields. So again, kind of play around with DISTINCT, see how it breaks out the values. Sometimes it requires you to maybe dig a little deeper and go in and look at the data stored in the database. Just to verify that it's working correctly, but don't expect there to never be duplicates coming back with this state depending on the use case scenario.
- Writing SELECT queries
- Querying multiple tables
- Filtering text and duplicates
- Sorting and grouping query results
- Using SQL Server's built-in functions
- Writing subqueries
- Using common table expressions
- Programming with T-SQL
- Interpreting query performance data