Learn how to clean data.
- [Voiceover] It's really important to remove duplicates from your data set, in order to preserve the data set's accuracy, and avoid producing incorrect and misleading statistics. For example, imagine you're analyzing retail sales data, and shopaholic Sally came in three times, and used three different credit cards to make purchases, but provided the cashier the same zip code, three two eight oh three, for each sale. Just based on the card number, Sally looks like three different customers, all from the three two eight oh three zip code.
If you fail to examine other attributes of the customer, so that you can identify and remove duplicates, shopaholic Sally's results would skew the results of any customer demographic analysis, because Sally would be counted as three people, rather than one. To market to three two eight oh three customers effectively, you need to understand their characteristics. Don't let duplicate records skew your analysis. It's time for me to show you how to actually remove duplicates from your data set.
In this demonstration, you're going to need to import numpy and pandas, so, we'll say import numpy as np, and import pandas as pd, and also, be sure to import your series and data frame from the pandas library, paste that in, and then run it. Okay, now we have the libraries we need. Okay, before I can show you how to remove duplicates, we need to have an object from which to remove them.
So, what I'm going to do is create a little data frame, data frame object, this is a data frame constructor, and I'll just patch in a dictionary, I'll copy and paste it in, basically, what this is doing is it's creating a data frame with three columns. The columns here are the keys, and then the values are the lists that are appearing on the right side of the colon, let's print that, you can see better what I mean. Okay, so we have a data frame here. And, it's got duplicate rows, as you can see, so what we need to do is, we want to remove those rows.
That's where the dot duplicated method comes in. The dot duplicated method searches the data frame, starting from the first row, and then moves down. For each original, non duplicate row it finds, the method returns a false value. As it moves down the data frame, when it encounters a row that was found earlier in the data frame, then it returns a true value, indicating that the row is a duplicate. So let me show you how to do this in practice. We'll just write the name of our data frame object, df underscore obj, and then recall the duplicated method off of it, really simple, huh? And then, when we execute this code, it returns a set of bullion values, true or false.
So, looking at our results here, we see that we have a false value that was returned for a one, that makes sense since there are no rows that came before it. But let's look at a row that returned a value of true, row six. If we look at row four, we can see that row six is a duplicate of it, row four returned a value of false. In other words, not a duplicate. That's because row four was the first row to contain that exact combination of values. Any subsequent rows that have the same combination of values will be counted as duplicates and return a false value.
Now that we've found the duplicate records, let's look at how we can drop them. To drop duplicates records, you can use the drop underscore duplicates method. The drop duplicates method searches the data frame and drops any rows that are duplicates of rows that came before, to use this method, you just write the name of your pandas object, and then call the drop underscore duplicates method off of it, execute the code, and you can see here, for our data frame object, rows two, four, six, and seven are dropped, and does that make sense? Let's look back up at our data frame.
So, row two, or the row with a series index value of one, was dropped, and that makes sense, 'cause it's a duplicate of the first row in the data frame. And then, the next row that was dropped was row four, which makes sense because it's a duplicate of row three, and so on, so, it looks like that yes, absolutely, our, all of our duplicate rows have been dropped from our data frame. I also want to show you how to drop records based on column values. In order to that, I want to make a small change to our data frame.
So let's go back up and copy the code that we used to create the data frame, and I'm just going to change this letter here from a C to a D for the purpose of our demonstration. So we print it out, beautiful, we've got that. Now we're going to use the drop duplicates method to drop rows from this data frame based on values in a column. When you patch in a label index, the drop duplicate method searches only that column. And for each duplicate it finds in that column, it drops the entire row, if we patch in the column three label index, we'd predict that the rows with a series index value of one, three, and six would be dropped.
That's because they contain duplicate records in the column three of the data frame. And I just want to show you real quick how to do this in Python, you just name your object, and then call the drop duplicates method off of it, and then patch in the name of the column index that you're interested in searching. So this tells Python to look at column three, and then, for each duplicate that's found along column three, drop that row.
And just as we predicted, it dropped the rows that had the series index values one, three, and six, now we have no duplicates in column three. Now that I've shown you how to drop duplicates from your data, I just want to highlight the point that it's really important that you check your data for duplicates, and remove them if you find them. Now, it's time to move on, to data concatenation and transformation.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup