From the course: SAS Essential Training: 1 Descriptive Analysis for Healthcare Research

Recoding a grouped variable - SAS Tutorial

From the course: SAS Essential Training: 1 Descriptive Analysis for Healthcare Research

Start my 1-month free trial

Recoding a grouped variable

- [Instructor] Back in SAS, in the last chapter, we read in our big B-R-F-S-S data set, and then carefully removed all the rows from the data set that were not part of our target population. Now, our next step is to follow our data dictionary, which is our plan to add a list of grouping variables and indicator variables that we will go on to use in our descriptive and regression analysis. In the first variable we will add is going to be called diabete four, so you will see I have the code opened named one, two, five, underscore, create, diabete four. I have included this code in your exercise files for this movie. So this is the goal. Create recode of diabete three in diabete four. In other words, we don't like the native coding of native grouping variable, diabete three. So we want to create our own variable, diabete four, and collapse categories in the way we want them. Let's refresh our memories by opening up our data dictionary again. Here's the same data dictionary from earlier in the course. Here we are on the tab for diabete three that documents how we are going to code both grouping and indicator variables based on native variable, diabete three. And here is the problem. The top two levels both mean yes and we want them collapsed. We also want to put the two no levels together and collapse don't know, not sure, with refused. This is what I mean about turning diabete three into diabete four. This is the coding we wish we had so we will put it in diabete four. Let's go back to SAS and do that. So here, we start by saying data R dot B-R-F-S-S underscore G, because we are outputting B-R-F-S-S underscore G, and we say set R dot B-R-F-S-S underscore F because that's the current version of our transformed data set. When we read out B-R-F-S-S underscore G, it will have all the variables we kept from our keep statement earlier, plus the new, improved variable, diabete four. So within this data step, we will create diabete four. See? Here's me starting the painstaking process of creating diabete four and getting it coded properly based on the values in diabete three. I start by creating diabete four and setting the values for all respondents to nine, which means unknown, regardless of the record. You will see below, then, that I edit that nine and replace it with the correct coding based on coding in diabete three. The next line will modify the value of nine in diabete four based on criteria in diabete three. You'll recognize my use of if here. If diabete three in one or two, then diabete four equals one. So what does this mean? It means that if diabete three has a value of either one or two, then diabete four should have the nine in it overwritten to be a one. Notice that the actual code line starts with the if and ends with the semicolon after the one. This is the next line that will be executed in the data step. And then we move on. I used a different syntax in this line simply because I wanted to demonstrate using the pike character for or. In the previous if statement, we used in to make a list of two values in diabete three that would qualify diabete four to receive a one. Here, we have two values from diabete three that would qualify diabete four to be coded as a two, but this time, we are going to say if diabete three equals three or diabete three equals four, then diabete four equals two. Notice the parentheses around the whole statement. You need those. So we create diabete four by setting all records to nine, then we update those with values indicating yes in diabete three to one, and those with values indicating no in diabete three to two. This is followed by a run command. Let's highlight and run the data step. Okay, now we are left wondering if we coded our new variable, diabete four, right. Let's look at this PROC FREQ code down here. Up until now, we've been doing one-way frequencies, but now, we will want to do a cross tabs between diabete three and diabete four to see if diabete four was coded properly. See how we do this. We say table diabete three times diabete four. We use an asterisk to indicate by. The rest of the code is the same. Okay, let's highlight and run that. Wow, too much information. What's in this table? Oh, here it says the first line in every cell is a frequency. Then we have percent in the data set, percent in the row, and percent in the column. We only care about the frequency, so let's go back to the code and add options to make this easier to look at. See down here? I added another option. The list option before the missing option. That will make the output in a list. Let's highlight this and run. Much better. Okay, we can see diabete four is correct. Those coded as one or two in diabete three are indeed one and diabete four. And those coded three or four in diabete three are now two and diabete four. You are already good at grouping variables, so you are ready for a challenge. Join me in the next movie where we make some smoking variables.

Contents