Learn how to filter and select data.
- [Instructor] The first thing we're going to talk about today is filtering and selecting data with Pandas. But before going into the lesson, I want to just give you a brief recap of the Pandas library and some essential basics about using Python. Pandas is useful for its fast data cleansing preparation and analysis capabilities. It's easy to use for data visualization and machine learning. Pandas is actually built on top of the NumPy library and it makes it easy to work with arrays and matrices.
Arrays and matrices are called Series and DataFrames in Pandas. Throughout this course, we're going to consistently be using the NumPy and Pandas library. I want to give you a brief refresher on indexing and what is an index because it's really important for this lesson. An index is a list of integers or labels you use to uniquely identify rows or columns. In this course, we're going to be indexing using square brackets or the special field .ix indexer.
I also want to give you a brief recap of what a DataFrame object is. A DataFrame object is pretty much like a spreadsheet of rows and columns. The rows and columns individually are actually series objects in the Pandas library. And DataFrames are index-able. A series object is a single row or column and it's always indexed. In the following coding demonstration, we're going to be using comparison operators. So I wanted to give you this table.
You can review it on your own just to re-familiarize yourself with using comparison operators in Python. The two I wanted to point out are simply the greater than or less than symbols because we'll use those in the following coding demonstration. Next up, I'm going to show you how to do plain indexing, data slicing and arithmetic comparisons using Python and Pandas. Alright, let's get started. The first thing we need to do is import our libraries.
So we're going to import NumPy as np and import Pandas as pd. You also want to make sure to import series and DataFrame from Pandas. So we just write Pandas from Pandas. Import series, DataFrame. Back in the what you should know video, I showed you how to download and install Anaconda package. And within the Anaconda package, you got NumPy and Pandas.
What you're doing here is actually importing those libraries into the iPython environment. But if you didn't get a chance, you might want to go back and watch the video so you can see how to download and install Anaconda. And we just run this code. I like to use the shift+enter command and now we've imported our libraries. If we're selecting and retrieving data, we have to have a data object from which to select and retrieve. So let's create that now. We'll call it series_obj for series object.
And then we'll call the series constructor. I'm going to pass in a NumPy function called NumPy arrange and what I'm doing is telling NumPy to create a series of eight values from zero to seven. Let's also add an index by passing in the index argument. And what we'll do is we'll pass in a list of eight label indexes and we'll call those row one, row two and so on for each of the eight values in our series.
Let me just clean up these parentheses here. Okay, so we have now eight index labels and we'll print the object and see what comes out. Alright, great, so we have a series object with eight elements, eight numbers and it's indexed. When you write square brackets with an index value inside of them, this tells Python to select and retrieve all records at that index. You can write an index value in two forms.
Label index or integer index. In this demonstration I will show you how to use a label index to select a row. So let's just write our series object we just created. Write some square brackets. It's going to be for indexing. And say row seven. We're asking it select and retrieve the row that's labeled row seven. And as you can see here we get back value six. That makes perfect sense because that's the row that's indexed as row seven. Indexing works the same way of integer indexes.
You just pass in the integer value of the row you want to select and return and the execute the code to return those records. One thing I want to point out here is although we created label indexes for our series object, numerically the index value is still zero for the first row, one for the second row and so on. So in order to retrieve records based on the integer index value, we can just pass those values into the indexer.
Let me show you how here. We'll write series object and then we'll pass in zero and seven. We're saying we want the rows that have the integer index zero and seven. Let's see what it returns. Ya, perfect. It returned row one and row eight. This is the zeroth element in the series object. And this is the seventh element in the series object. That's why these two records were the ones that were returned.
You can also use Pandas special indexer method .ix to select values from within a DataFrame. First we'll create a DataFrame and then I'll show you how to use the special indexer. To create the DataFrame, we're going to use a random number generator from NumPy. So to do that, first we need to set the seed of the random number generator. We do this so that you get the exact same random numbers as you're going to see here in the demonstration. And then we'll create an object. We're going to call it DF_obj for a DataFrame object.
And call the DataFrame constructor and then we're going to use the random number generator to generate a series of 36 random numbers. To do that, we write np.random.rand and we pass in a value of 36 numbers. And then we're going to call the reshape method off of this, and what reshape does is it puts these numbers into a matrix with six rows and six columns.
And then what I want to do here is also create and index for the DataFrame. So we pass in an index argument and write a list with row label the indexes we want to use so row one through row six since we've got six rows. And let's also create some column indexes. So we'll say columns= and we'll create a list and add some column label indexes. Column one, two and remove the space there.
Let's print it out. DF_obj. See what that looks like. Okay great. So when you print this on your home machine, you should get the exact same series of numbers because you set your seed. And now we have a DataFrame. When you call the .ix special indexer in passing the set of row and column indexes, this tells Python to select and retrieve only the specific rows and columns. Here's how. We'll write the name of our DataFrame object and then we'll call the .ix special indexer off of it.
We'll setup brackets and then I'll say okay, I want only row two and row five and as far as columns we want column five and column two. We run that. We see that it's selected and retrieved the values that were indexed by the label indexes row two and row five and column five and column two.
If you go back and you look at the records here you can see row two at column two has got 4.37 and then down here what's been returned at row two column two is also 0.437. Now let's move on to data slicing. You can use slicing to select and return a slice of several values from a data set. Slicing uses index values so you can use the same square brackets when doing data slicing.
How slicing differs, however, is that with slicing you pass in two index values that are separated by a colon. The index value on the left side of the colon should be the first value you want to select. On the right side of the colon, you write the index value for the last value you want to retrieve. When you execute the code, the indexer then simply finds the first record and the last record and returns every record in between them. It's really simple. Let me show you how to do that. We'll use our series object and in between the square brackets we just write row three so that's our first record we want to return.
Return row seven. Don't forget your colon here because that tells Python to select row three, row seven, and everything in between. When we run this, you see we get row three, row seven and everything in between. Now we're going to talk about comparison operators and scalar values. Just in case you don't know that a scalar value is, it's basically just a single numerical value. You can use comparison operators like greater than or less than to return true/false values for all records to indicate how each element compares to a scalar value.
Going back to our DataFrame object, let's write an expression to see whether each value is less than point two. We do that by writing the name of the object, DF_obj and then we just write less than .2 and execute the code. Okay, let's check one of the value's that's returned. Here row one, column six returned a true value. That's saying that this element is less than point two. Let's check.
Here we go, row one, column six. It is less than point two. Alright, great. So that makes sense. Moving on to filtering with scalars, you can also use scalars and comparison operators to create a filter that limits what values are returned by the Python indexer. Let's tell Python to return all rows that have values greater than six. We'll use our series object here, so we'll write the name of out object series_obj and then put in some square brackets for the indexer.
And then inside them let's write series_obj and then we'll set that greater than six. What we're really doing here is we're telling Python to find all rows that have a value greater than six. When we write this expression inside of square brackets, that are written off of a series object, we tell Python to select and retrieve the rows that satisfy that expression. So let's run this and see what we get back. Okay, it's returning row eight.
And looking back at our series object, row eight has a value of seven. Seven is greater than six. That's why this was the row that was retrieved. Next we're going to talk about setting values with scalars. So I'll scroll up here a little bit. Setting is where you select all records associated with specific index values and then set those values equal to a scalar value. In this example, let's select row one, five and eight and set those equal to the value eight.
To do that we write the name of the object we want to work with. So that's series_obj. Create our square brackets and then name the rows. What we're doing here is we're specifying the label indexes for the rows we want to retrieve. So we're going to pass in row one, row five and row eight. And now we're going to say set those rows equal to eight. What this actually does is set the values in each of those rows equal to eight.
And we'll print this out. You see we get row one, row five and row eight. The elements at each of those rows has been set equal to eight. Filtering and selecting using Pandas is one of the most fundamental things you'll do in data analysis. Make sure you know how to use indexing to select and retrieve records.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup