Start free trial Sign in

From the course: Descriptive Healthcare Analytics in R

Reviewing categorical variable distribution - R Tutorial

From the course: Descriptive Healthcare Analytics in R

Start my 1-month free trial

Reviewing categorical variable distribution

“

- [Narrator] Welcome to chapter four, section five, where we review our categorical variable distributions. In the last movie, I showed you the table shells I set up. One for our categorical outcome of asthma, and one for our continuous outcome of sleep duration. In this lecture, I'm going to show you how to look at the distribution of the asthma variable alone, which is univariate analysis. Then, I'm going to show you how that relates to the exposure, drinking alcohol, which would be a bivariate analysis. Let's go over to R. Here is the beginning of our 200 analysis code. The first step is to read in our analytic data set. So, see, I used the read.csv command to read in the analytic data set file here. And I call it analytic. Let's highlight and do Control + R to run that line of code. Great, it read it. Next, notice I use the simple table command to get the one-way frequency of ASTHMA4. But this time, I used the arrow to make a data frame, named the lovely name of AsthmaFreq. As we analysts say, now we are getting freaky. But in all seriousness, this code just makes a data frame object out of our table results. After that, I can use write.csv to make a CSV file of that table that will be read out into our data folder. After that, I can go to our data folder and open up the CSV and copy out the numbers. But we can see them here because I included a line that just names the data frame. That gets to print in the console window. Let's run all this code. But we are specifically concerned about the distribution of the categorical variable. So let me calculate a percentage. See the numbers from our table in the console? I can just do a division operation and create an object, which is just one variable this time, like a list with only one thing in it, called PropAsthma. Let's run this and look at PropAsthma. Highlight and Control + R. Yeah, about 10% of our people have asthma. That's good for regression, which is the next course. You don't want the outcome to be too rare, and 10% or greater is good. You get iffy below that. And you should think about a different study design if that happens. But we are good to go. Let's visualize what we just found. I created an Excel spreadsheet with two tabs to document my variable distribution results. These are private spreadsheets, not things you publish. Not like the table ones. These are just for your own consumption to help you out. See how I have this blank chart and the space to put our numerical results? I'll show you what to do. Let's go to our data folder. Here's AsthmaFreq, the table we just read out. Let's open it. See how we have the numbers of zeros, which is no asthma, and the number of ones, which is having asthma? We can put these numbers on our Excel spreadsheet. This is everyone in the data set. We knew about this. We knew the prevalence of asthma was about 10% in this data set of veterans. But let's make a pie chart anyway, just to visualize it. I'm going to highlight just the frequencies and do Control + C for copy. We will now click on the top cell, next to No, and do Control + V to paste. That looks about right. Okay, our next order of business is looking at asthma, the categorical outcome, against the exposure, which is alcohol group. So, we need a cross tabs. Let's go back to R. Here we've got almost the same code, but we named it AsmthaAlcFreq as we added the alcohol grouping variable in there. You'll see why we made all these grouping variables in the last chapter. It's to facilitate making table one. Let's highlight and Control + R to run this code. Great. Okay, let's open up the CSV that was the output and then copy our info onto our spreadsheet to make a chart like we did last time. Here it is in our data folder. Let's open it up. See how we have the one, two, three across the top for our three levels of alcohol consumption? Let's just highlight the cells with the frequencies and do Control + C to copy. Here we are back at our private visualization spreadsheet. Again, let's place our cursor in the upper left cell of the table and do Control + V. This will fill out our graph, and then we can scroll down and look more closely. I prepared this Excel sheet beforehand, but you can always make these graphs yourself in Excel or in any other graphing program. And you artists could probably make a way more beautiful one manually. But let's review what this chart says. Well, it doesn't look good for our hypothesis so far. Look at how the non-drinkers seem to have lower rates of asthma than both the drinking groups. This always happens to me. Alcohol seems to often look good in these data sets when it shouldn't. You probably heard that on the news. It happens a lot to us. Okay, it's a dangerous variable, but good thing we looked at the raw bivariate distribution between exposure and outcome. Maybe they are related, but not the way we hypothesized. Now we are done with our code for checking asthma. So you will see I saved this code as 200_Check asthma So the steps we are taking now are to evaluate the distribution of our two outcome variables with our exposure variable. In this movie, we did it with asthma and alcohol. And in the next section, we will look at our continuous outcome, sleep duration and alcohol. In this section, we went over how to do a univariate and bivariate analysis of our categorical outcome, which is asthma. We outputted CSVs, opened them up, and copied and pasted the results into a private spreadsheet, which serves to document our outcome variable distributions. So let's move on to the next section where we look at the distribution of our continuous variable, sleep duration.

Contents