Join Barton Poulson for an in-depth discussion in this video Creating scatterplots, part of Learning R.
When you're looking at associations in your data, if you want to look at how two quantitative variables are associated with each other, the most common approach is to create a scatterplot. R gives you some interesting options on how to create scatterplots, and look at what you have in terms of associations in your data. For this one, I'm going to be using the Google correlate data that I used in the last movie. I'm going to load it by running line 6. I'll create a data frame called Google by reading the csv, google_correlate.csv, that has a header.
There I have 51 observations. There's one line for each state, and D.C. We're going to look at the names of the variables that are in that data set. We can look at the structure too if we want, just to get an idea of what things look like. I'm going to make this bigger for just a moment. Okay, that's pretty busy. I'm going to just clear it out for right now. What I want to ask is whether there's an association between the percentage of people in the state with college degrees, and interest in data visualization as a search term on Google. What I'm going to do is create a scatterplot.
The default plot works well. All I say is plot; that means scatterplot, and I give my variables for X and Y. I'm going to put degree on the X, and so I say, use degree from the data set Google, and then I'm going to put data_viz on Y. So, I run line 13, and there's my plot. You can see that there's a strong positive association. The higher the number of people with college degrees, the greater the interest in data visualization as a search topic. That's actually a really clear trend.
On the other hand, I'm gong to clean up this chart a little bit. I'm going to put a title on the top. This is lines 15 through 20. I'm going to do the plot again, except this time I'm going to put a title on the top; that's main, and then I'm going to put a label on the X axis, xlab, Population with College Degrees. Label on the Y axis; Searches for Data Visualization. Pch here is for representing the points, and I'm going to be using choice number 20, which is a small solid dot. I'm going to color it in gray.
So, I'm going to highlight those six lines together, and run those. Now we have this scatterplot with light gray dots, which you can still see the pattern, but there's less sort of fluff to it. We have the title on the top, and we have the labels for each axis. Now I'm going to do one more thing. When you're looking at an association in the scatterplot, even though we have a strong positive pattern here, it's really nice to have regression lines. I can add a regression line with a abline. I'm going to use a linear model, that's what this is, and it's going to be based on the association, where I'm trying to predict data_viz, and then the tilde means predicting it from the number of degrees, and I'm going to color that line red.
So, I'm just going to run line 23, and this is going to layer on top of the plot that I have already. So, you can see that there's a strong positive association if we draw a straight line through it. On the other hand, not every association is linear, and sometimes it's helpful to use a line that matches the shape of the data. One of those options was called a Lowess smoother, and that's what I'm going to do in line 25. I'm going to add a line, and it's going to be Lowess, and I'm going to be using it for a degree in data_viz.
Please note that the order of the two variables is different here. The top one for the regression line, I had to put the Y first, and then the X. This one, I put the X, and then the Y. Also, in the top one, I use the tilde to say that the Y is predicted by the X. This one is simply putting what they are with a comma in between. I'm going to make this Lowess line blue. So, I'm going to run line 25, and then I'll just put it on top of that. A lowess is sort of a moving average, and you can see here that actually it doesn't deviate tremendously from the linear regression line.
What both of these do is they emphasize the strong positive association between the percentage of the population in the state who have college degrees, and the relative interest in searching for data visualization on Google. These are really good ways of looking at the association between two quantitative variables, and will lead into regression, which we're going to do in a later movie.
The course continues with examples on how to create charts and plots, check statistical assumptions and the reliability of your data, look for data outliers, and use other data analysis tools. Finally, learn how to get charts and tables out of R and share your results with presentations and web pages.
- What is R?
- Installing R
- Creating bar character for categorical variables
- Building histograms
- Calculating frequencies and descriptives
- Computing new variables
- Creating scatterplots
- Comparing means