Learn how to create statistical plots, including histograms and scatter plots.
- [Instructor] Statistical plots allow viewers to identify outliers, visualize distributions, deduce variable types and discover relationships and core relations between variables in a dataset. In this course I'm going to show you how to use statistical plots to visually detect outliers, deduce variable distribution and type, and uncover relationships and core relation between variables. Histograms are very simple plots that are used to show variable distribution. Scatterplots on the other hand, are used to show relationships between variables.
Scatterplot matrices show core relations between variables and box plots show variable spread and are useful for outlier detection. Let me show you how to create these in Python. In this demonstration, we're going to be using npi in Panda's library. From Pandas we want to import the tool for scatterplot matrices. So we'll say from Pands.tool.plotting. This is the module that has the tool.
We'll say import scatter_matrix. And we're also going to be using Matplotlib and Seaborn like we have throughout this chapter. So we'll import those. Looks like I've forgotten A there, so I need to fix that and then I can run this and get our libraries imported into our environment. And we'll set our parameters for the Jupyter Notebook, like usual. And let's start with histograms first. We're going to use the mtcars dataset that we've been using throughout this course.
So let's load that and then we're going to isolate the mpg variable. So we'll say mpg cars and then select the mpg label index. And let's create a histogram of this variable. To do that we call the plot method off of the mpg variable and then pass in the argument kind=hist. And there we have it, a histogram. You could also generate a histogram by calling the hist function and passing in the mpg variable.
So we could go plt.hist and then say mpg and print that out. It creates the same histogram. It's just a different method for creating a histogram with Matplotlib. Let me show you how to create a histogram with Seaborn. Seaborn's really great for statistical plotting which is another reason that I made sure to introduce this into most segments in this course. To create a histogram in Seaborn you call the distplot function and you pass in the name of the variable you're interested in.
So, for here, that'll be mpg. And it's nice 'cause we've got a trend line here. It shows the distribution as a line. And Seaborn does that for us automatically. You can kind of see the differences, it's a lot nicer for statistical plotting. And to create a scatterplot using Matplotlib, let's just create a scatterplot from data that's in our cars dataset. So, we're going to call the plot method off of that dataset.
And we're going to pass in kind=scatter. That's how you tell Python you want a scatterplot. And for our X variable, let's use the hp column from the cars dataset. And we'll pass in Y equal to mpg. So the mpg variable is plotted along the Y axis. And then let's give it a color equal to dark gray. We create a list with a string that says darkgray.
And then we'll say we want the marker size to be 150. To do that we say s equal to 150. And we print it out, and now we can see mpg is plotted on the Y axis, hp on the X axis, and it looks like they have a linear relationship between them. To do this in Seaborn you use the red plot function. So you'd say sb.regplot and similar to what we did above, you could say, pass in X is equal to hp, Y is equal to mpg.
But, here's where things differ. You're going to say data equal to cars. This argument says, tell Seaborn what dataset you want plotted. And then scatter equal to true. And as you can see, Seaborn automatically creates a trend line. Now let's look at how to create a scatterplot matrix. Seaborn is really the easiest library for creating scatterplot matrices. Use a pair plot function to do that. You'd say sb.pairplot and pass in the cars dataframe.
It takes a few seconds to plot out. Okay, so we have a scatterplot matrix. But it's a lot of information. With so many variables it's too small for us to really deduce much. We just plotted out two dimensions of data, the X and the Y. But I'm going to show you now how to add a third dimension of information to this chart by adding in categorical coloring. First let's make a subset. We'll call it cars_df and we're going to make it a dataframe. So, we'll call our dataframe constructor and we'll use the special indexer function that you saw earlier in this course, and we'll select the columns with the label one, three, four and six.
And then we want to access the values in those columns. So we'll say .values and let's name those columns in our new data frame. So we'll say column is equal to and create a list that says mpg, displacement, disp, hp and wait. Let's also isolate a target variable. For our target we're going to choose the column called am which is for automatic and manual transmission. So, we'll again use our special indexer.
But we're going to select the column that is number nine and access the values in that column by calling .values. And let's set some target names for that variable. Target_names and then say, it can be either a zero for automatic transmission or a one for manual. Now in order to add this third dimension of information, what we're going to do is we're going to create a new variable in the cars data frame and we're going to call it group. It's going to be a categorical variable.
Then we're going to tell Seaborn to go through that variable and color each point in the scatterplot matrix according to the value in that category. So in order to add the variable, we say cars_df and we'll call this new variable, group. And then we'll say we want this group variable to be a series object. So we call our series constructor and we want to build the series from data in the cars target variable that we just created.
But we want to pass in dtype equal to category to tell Python that this needs to be a categorical variable. And then we use the Seaborn's pair plot function. So we say sb.pairplot and we pass in cars_df, the name of the dataframe we want plotted. Then we're going to pass in an argument, hue equal to group. And this tells Seaborn to pick the colors for the points based on the values in the group column.
And also I'm going to pass in palette equal to hls. This is a pre-built color palettes. I chose hls from the documentation, but you can also look on the documentation for yourself and pick your own color scheme. We plot this out. This is much easier to read because it's a smaller amount of data. And let's look at what it's saying to us. So the red indicates zero for automatic transmission and blue is manual transmission.
And so what this plot is actually telling us, is that cars that weight more tend to have automatic transmission, and when they weigh less, they tend to have manual transmission. Also we can deduce that cars that have an automatic transmission get less miles per gallon and cars with a manual transmission get more miles per gallon. I also want to show you how to build box plots in Matplotlib. To do that, you just call the box plot method and pass in arguments for the column and by.
I'll show you. So let's say we want to create a box plot from the cars data. So we call the box plot method off of the cars dataframe. And we say we want to have the column, column is equal to the mpg variable and we want it plotted by the am variable, automatic or manual transmission. Let's create a second box plot that plots weight versus transmission type.
And this is basically just showing us the same thing we saw in the scatterplot matrix really. Cars that have automatic transmission get less miles per gallon than cars that have a manual transmission. And also, cars that have a manual transmission tend to weigh less than cars that have an automatic transmission. One last thing I want to show you is how to create box plots in Seaborn. For that you'd use the box plots function. So it's sb.boxplot.
You're going to say, on our X axis we want to have the transmission type plotted out, so X equals to am, and Y is equal to mpg variable. We're pulling our data from the cars dataset. So we'll say data equal to cars and again, we're going to use the palette of hls. Nice. We see a box plot that Seaborn created with a pretty color scheme and it's equivalent to this Matplotlib box plot except for it's just prettier.
This ties up the data visualization section for this course. But pretty soon I'm going to show you how to use these statistical plots to detect outliers and uncover correlations, and stuff like that. So stay tuned.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup