From the course: SAS Programming for R Users, Part 2

The CORR and FREQ procedures - SAS Tutorial

From the course: SAS Programming for R Users, Part 2

The CORR and FREQ procedures

- [Instructor] In this section, I'll introduce four different procedures to analyze variables and generate summary statistics. We'll reproduce the cor and cov functions in R, as well as the table function for frequency tables of classification variables. We'll also generate the qqnorm plot, which is not in PROC SGPLOT, and we'll compute summary statistics, like mean, median, mode, range, and so on. And these'll be applied to the entire column or variable of your data set. So remember, we used functions in the data step, and they were only applied across rows. In this section and for these procedures, they will operate on the entire variable. First, proc corr does exactly what you'd expect. It makes a correlation matrix. So, in the var statement of proc corr, I'll just list all the variables I want added into my correlation matrix. In this case, for the cars data set, I want horsepower, weight, and length. And if I run this procedure, I'll get the following output. You'll notice, on the right, we have the correlation matrix. And there's two values in each cell. The first is the estimated correlation. And the second is the hypothesis test p value, testing the population correlation coefficient. So, for example, my correlation coefficient between horsepower and weight is .63, and my p value is less than .001, meaning it's highly significant. Also by default, we get the simple statistics table. So we get the number of observations, n, the mean, standard deviation, sum, minimum, and maximum for those three variables, as well. But, in just a few minutes, I'll show you a better way to customize the simple statistics that you want to see. If you tack on the cov option in the proc corr statement, in addition to the previous tables, we get the covariance matrix, which is the same as the cofunction in R, so there's lots and lots of different options you can specify in these summary statistics procedures. Next, when we're working with categorical data, we'll use proc freq to create frequency tables. And instead of the var statement, we'll use the tables statement. And simply specify all the one-way frequency tables you want to generate in the tables statement. So here I'll generate two separate tables for origin and type, and if we run this procedure, we get the following tables. So we, of course, get the frequency for each level of each variable and, by default, we also get the percentage of observations in that level, as well as the cumulative frequency and cumulative percent. Now, if you want to reproduce your tables exactly like you'd see them in R, we can just use options after the forward slash in the tables statement. Specifically, we can use the NOCUM and NOPERCENT options to get rid of those columns in the tables. If you want to do a cross-tabulation, you'll simply cross your variables in the tables statement with the star operator. And, if I ran this procedure, I would get the following output. I'm crossing origin and type, and in each cell, I have the frequency percent, row percent, and col percent, just like we saw before. So, for example, all three vehicles that were hybrid vehicles came from Asia, and that corresponded to only .7% of our data. And on the bottom and far right of the table, we get the totals. Just like we saw before, we can control the output with someoptions, specifically we can suppress the rows, columns, the percentage, and the frequency, if we want. It's unlikely that you'll use the NOFREQ option, but you can if you'd like. So, here I'm reproducing the table function exactly like you'd see it in R. In my tables statement, I'm crossing origin and type. After the forward slash, I'm specifying norow nocol nopercent. So all I have in each cell are the frequencies. Previously, with PROC SQL, we showed you how to print the unique levels of a variable, but perhaps there is hundreds, maybe even thousands, of levels in a specific variable. What if I just want to print the number of levels in each variable? Well, I can use proc freq, and in the proc freq statement, I'll use the nlevels option. That'll go ahead and print the following table, number of variable levels, so for origin, of course, there's only three levels and type, there's only six levels. And if you don't want to print the original frequency tables, you can use the noprint option in the tables statement.

Contents