From the course: SAS® 9.4 Cert Prep: Part 03 Exploring and Validating Data

Demo: Identifying and removing duplicate values - SAS Tutorial

From the course: SAS® 9.4 Cert Prep: Part 03 Exploring and Validating Data

Demo: Identifying and removing duplicate values

- [Instructor] Let's use the NODUPRECS, and NODUPKEY options, in PROC SORT, to identify and remove duplicates. We'll start by looking at the storm detail SAS table. So we called this is the data set that includes multiple rows per storm. A few things to know about the data. We have measurements every six hours. It's possible that there may be some rows within this data, that are entirely duplicated. In that situation we'd like to remove those extra rows, that are entirely duplicated, all the way across for each column. We'd also like to filter this data, in a way that we include one row per storm. Each storm is uniquely identified by the season, the basin, and the name. It's possible that names are reused over the different seasons. So by identifying each unique storm, we can then keep only the first row within each storm, and output that to a different table. We can use PROC SORT to accomplish both of these tasks. I'll close the storm detail table, and come to my program. So in this first PROC SORT step, we're sorting that storm detailed table, and created a copy using the out equal option, that will be a temporary table named storm underscore clean. In this first step, let's eliminate any entirely duplicated rows. So we use the NODUPRECS option, and if there are any duplicate rows that are removed, I'd like to write them to an output table. So I'll add the DUPOUT option, and name the table storm underscore dups. Now remember the NODUPRECS option, removes adjacent rows that are entirely duplicated. We have to make sure those duplicate rows are next to each other. And the way we do that is sorting by all of the columns. So on the by statement, I'll use underscore all underscore, as a shortcut. I'll run that first PROC SORT step. I'm looking to start with, at the storm clean table. So this has all of the duplicated rows removed. I can also examine the Storm Dups table, and I that there were 214 entirely duplicated rows, that were removed. So storm clean is in a much better shape. Let's go back to the code. Let's tale a look at what's in step two. So notice in step two we're also reading that storm detail table. This time, creating out output table named Min Pressure. We're filtering the data, to include only those rows where pressure is not missing, and name is not missing. Notice how we're sorting the data in this instance. We're sorting by descending season. So from 2016 down to 1980, and them we'll sort within season, by ascending basin, name, and then pressure. Let's run that step and see how the data looks. So here's our 2016 season. We notice we have basin EP. And within basin EP, we're sorted by name. Now, at the very top, remember the last column we sorted by was pressure. So the first row for this unique storm, Agatha, in the Eastern Pacific basin, for the 2016 season, the minimum pressure was 1002. If I scroll down to the next storm, again, the first row represents the minimum pressure recorded for this storm, in 2016, EP, Blas. So what if I want to take this table, and keep only the first row for each unique storm? Which would represent the minimum pressure? We're interested in the lowest measurement for pressure for each storm, because that's an indication of the storm intensity. So back in the program, let's do the third PROC SORT step. We're going to sort the min pressure table, which was the output table from the previous PROC SORT step. This time I'll sort by descending season, basin, and name. Now, notice that matches the same sequence that we have in step two, minus the pressure column that we have at the end. But I would like to keep only the first occurrence for each unique value, of season, basin and name. And I can do that by adding the NODUPKEY option. And I'll run the third PROC SORT step. And notice we have one row per storm, and pressure represents the minimum pressure recorded, from that storm detail table. There are many different reasons why you might want to sort your data. It might be just for display. It might be for further analysis. You can see with PROC SORT you have many options, to decide which rows you want to include or exclude.

Contents