Learn how to generate summary statistics.
- [Instructor] Descriptive statistics describe a variable's values and their spread. For example, imagine you work for a company that monitors patient's health in real time and you need to build a script that detects dangerous anomalies. You could generate summary statistics in microbatch and then calculate the mean, max, and standard deviation of incoming health data that's generated from the monitoring device. With these descriptive statistics, you generate automatic alerts when unusually high or low data points are generated by the patient's monitoring device indicating potentially dangerous health status of the monitored patient.
There are two categories of descriptive statistics. There are the kind that describe the values of the observations in a variable, and the other kind that describe a variable spread. Descriptive statistics provide a quantitative summary of a variable and the data points that comprise it. You can use them to get an understanding of a variable and the attributes that it represents. Let's look at descriptive statistics that describe the observations in a variable. Those are sum, median, mean, and max.
There are also the descriptive statistics that describe a variable's spread. Those are standard deviation, variance, counts, and quartiles. If this doesn't make real sense to you right now. Just hold on, because I'm going to show you an example in the coding demonstration and you'll get what I'm talking about. But before that, I want to explain to you a few uses for descriptive statistics. Like in the example I just gave you, you can use descriptive statistics to detect outliers. You can also use them for planning data preparation requirements for data preprocessing, and for selecting features for use in machine learning.
I'm going to show you more about how to do that in the next few videos, so hold on. Okay, let's get into the coding demonstration now. Here we are now back in the Jupyter notebook, and the first thing we're going to do is import our libraries. So in this demonstration we're going to be using numpy and pandas, as we did in the last chapter, but we're also going to be using scipy, so we need to import the scipy library by saying import scipy, and then from scipy we want to import the stats module, import stats.
And also in this demonstration we're going to be using the cars data set that we used in the last chapter, so we want to import that, and we'll print out the first few records in that data frame. Here you can see the variables in the cars data frame. Now I'm going to show you how to use the sum method. The sum method adds up the total of the numbers in a column or in rows of a data frame. By default, sum will count up values and provide a total for each column, but if you pass in the axis=1 argument then it will add up the values along the data frame rows instead.
Let me show you. We'll say here cars and call the sum method off of it. So what this has returned is a count for each of the values in each of the columns in the data frame. If you wanted Python to count up the values by row instead, you just call the sum method off of the cars data frame and then pass in axis=1 and print that out, and what you see here is that for each of the 31 rows in the data frame Python has gone through and generated a count horizontally, and that's what these return values are here.
Now let's move onto median. The median method finds and returns a median value, or the middle value, from the columns or rows of a data frame. When you tell Python to find the median of an entire data frame, it returns the median value for each column of that data frame. Let's find the median of the cars data frame. We'll do that by writing cars and calling the median method off of it. And here we have it. This is for each column in the cars data frame. Python has gone in and found the median value and returned that.
And mean works pretty much the same way, except for that it's going to return an average value from either the column or the rows. So in order to calculate an average value for each of the columns in the data frame we just write cars and then call the mean method off of that, and what you see here it's an average of each of the values of the columns in the original data frame. It's also useful to be able to find the max value of a variable.
So in order to do that you can call the max method off of your data frame and it will return the maximum value for each of the variables in the data frame. I'll show you now just by getting cars, calling max method off it, and then here we go. What this represents is say for mpg, let's look at that one. For the mpg variable, the maximum value in that variable is 33.9. And I'm going to refer back to that in a minute here.
We'll just keep that in mind. If you wanted to be able to identify the row where this maximum value came from, you'd call the .ixmethod method to see the index value of the row that contains the maximum value. It's easier for me to show you. So first we're going to isolate the mpg variable, and then I'll call the idmax method off of it. Alright, mpg.idxmax, and it returns a value of 19. That value 19 represents the index number of the row where the max value was found in the mpg variable.
For mpg, the max value is 33.9, and that's found in Row 19. Now it's time to time to look at summary statistics that describe variables' distribution. The .std method calculates the standard deviation of columns or rows in a data frame. If you call the .std method off of a data frame it will return the standard deviation for each column in that data frame. I'll show you here how to do that now. You just say cars.std and then run that, and this returns the standard deviation for each variable in our data frame.
The .var method calculates the variants in columns or rows of a data frame. If you call .var method off of a data frame object by default it will return the variance for each column in that data frame. In order to do that we'll just write cars and then call the .var method off of the cars data frame. There's also the value_counts method. This method counts up the unique values in an array or a series object. It shows you how many unique values are present in a data set.
Let's look at the gear variable in particular. I'll isolate that. And then let's see how many unique values are in the gear variable. We'll write gear, and then call the value_counts method off of that. Okay, so what you're seeing here is that the gear variable has three unique values. Those are three, four, and five. On the right side, you can see the unique counts for each of those variables. If you wanted to take a broader perspective, what this is really saying is in the cars data set there are 15 cars with three gears, 12 cars with four gears, and five cars with five gears.
And I want to show you real fast an easy peezy way to get an entire statistical description of a data set. This is with the describe method. It returns a full statistical description of each variable. To use it we'll just say cars and then call the describe method off of that, and here we have it. For each variable we get a count, mean, standard deviation, min value. Here's the interquartile range. This is a measure of the distribution of the variable, and here's the max.
Now that you know how to summarize numerical variables, let's move onto summarizing categorical ones.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup