From the course: Data Visualization in R with ggplot2

Scatterplots

From the course: Data Visualization in R with ggplot2

Start my 1-month free trial

Scatterplots

- [Instructor] Scatterplots are one of the most basic visualizations, and they're often quite useful when exploring data. Scatterplots begin with a simple X-Y coordinate grid, and then we use them to plot points on that grid by specifying their X and Y coordinates. For example, the red point on the screen has coordinates two one, meaning that it has a value of two on the x-axis and a value of one on the y-axis. Similarly, the blue point has coordinates one three, meaning that it is at location one on the x-axis and location three on the y-axis. In ggplot, we use the point geometry to create scatterplots using the geom_point function. At a minimum, we must tell geom point where to get its X and Y values, but we can also specify other aesthetics, such as the shape, color, size, and transparency of the points on the scatterplot. Here's a reference table showing you the names of these aesthetics in ggplot. Most of them are intuitive, with the exception of remembering that the alpha aesthetic is used to set transparency. Let's try working with these in R. If you have access to the exercise files for this course, you can load the starting code file for this video which contains the code shown here to load the data set, and produce the simple scatterplot from the last video. If you take a careful look at this code, you can see that the first section simply loads the data. I'm going to go ahead and run this. And then if I look at my ggplot call, the first line calls the ggplot function and specifies that the data I would like to use is the college data set. That's the data that I just loaded into the environment in the college table. If I run this line by itself, I simply get the blank plot that we saw in the last video. Now I'd like to go ahead and add to this a call to geom point. This second line adds a point geometry to the grid, specifying that we should plot a point for each college using tuition as the x-axis value, and average SAT score as the y-axis value. When I run this entire statement, I get a simple scatterplot. Now let's try going beyond this and adding another dimension to the plot. Suppose I would like to differentiate between public and private schools in my plot. That value, public or private, is stored in a variable called control, and I can differentiate these schools by changing the shape of the point that's on the plot. Right now they're all circles, but maybe I want to use a different shape for public and private schools. I can do this by adding another dimension to the geom point geometry. I'm simply going to say shape equals control. This is saying add a shape aesthetic and change the shape of each point based upon the type of control of each institution in the data set. Now when I run this command, I get a slightly different plot. If you look carefully, you can see that we have both circles and triangles on this plot. The circles represent private institutions, and the triangles represent public institutions. Now this is still pretty hard to see, I don't really like using this shape aesthetic here because the triangles and circles are really blended together. I do get the sense that there's more circles on the right and more triangles on the left, but I can't see that very well. I think maybe color would be a better aesthetic to represent this difference. Instead if using shape to represent control, I'm going to change this and tell ggplot to use color to represent control. And now when I run this command, I get the same data, but I've used color, and I can see a much more striking difference showing me that private schools generally have higher tuition than public schools, and as SAT scores increase, tuition seems to tend to increase as well. I can also alter the size of each of these points, let's go ahead and do that to represent the number of students at each school so that larger schools, schools with more undergraduate students, have larger points. I can do that by just adding another aesthetic to the geom point call, and what I'm going to do is say I would like the size of the point to be set according to the undergrads variable in the college data set. And now when I run this command, I get points of different sizes. Bigger schools have bigger points, and smaller schools have smaller points. Now this is hard to see because when I increase the size of these points, a lot of them overlap. So I've gone and covered up some points with other points. That makes the chart difficult to interpret. That's where transparency can be very useful. Right now, all of the points are solid. If I add a little bit of transparency, I'll be able to see through the points and get a sense of how many are at each location on the graph. You might recall earlier that I can control this using the alpha aesthetic. I'm going to go ahead and add that to my geom point call, but I'm going to put it in a different place. I'm going to put it after the data mapping. I'm going to add here alpha equals one. Now, the reason I put it outside of that mapping is I'm not changing the transparency based upon any data in the data set. I simply want all of the points to have the same transparency. And when I run this, I get the same plot, because alpha equals one gives me opaque points. What I basically do when I set the alpha value is set the percentage of opacity that I want, so an alpha equals one means I want my points to be 100% opaque, meaning they're 0% transparent. I can go ahead and set this to the other extreme and say I would like them to be 1% opaque, which means 99% transparent by setting alpha equal to one over 100. Now when I run the plot, the points are basically invisible. You can see blurs where there are a lot of points overlapping with each other, but it's incredibly difficult to pick out individual points, so I'm going to keep playing with this. What if I set it to 10% opacity, one over 10? Now when I run it I can see the points better, that might not be good enough for me. What about if I try 50% transparency? Now I can get a pretty good sense of where the points are, but I can also see through them and get a sense of density. This is just a matter preference and your particular data set, you can go ahead and modify the alpha value until you find one that works for you. You now have the skills that you need to create interesting scatterplots from a data set in R.

Contents