Learn about cleaning data in pandas, iterating over rows, selecting subsets of columns and rows, renaming columns, creating new columns, melting DataFrames, working with pandas dates, and joining tables.
- [Instructor] In this video, I will show you some fundamentals of cleaning a data set using Pandas. I will follow the excellent article, Tidy Data, by Hadley Whickham. Whickam again stresses the importance of keys variable organization. And also of dividing multiple observation into separate rows. That's much easier to handle once you're coding. It's really easiest to show you this in practice. We're going to use one of Whickam's example files, billboard.csv.
First, however, we import all the Python modules that we need. Then we load the CSV file. Loading the file, you'll see Unicode decoding error. This file was not saved with a default Python encoding, which is UTF-8. In Unix or OS X, we can easily find out the encoding with a command line utility, file. So, we invoke it with an exclamation mark.
In Windows, you may have to open the file with an editor. ISO-8859 is actually called Latin-1 in Python. So, let's try loading the file again by specifying the encoding. Everything went well this time. In Pandas, we can use head to see the first few lines of a table.
These rankings of songs are from year 2000. For each song on the Billboard, we get lots of information. Including the artist, the track name, the track length, the genre, the date when the song entered the Billboard, the date it peaked, and its ranking for 76 weeks. Let's look at the name of the columns. And let me correct the name of the object. We can do many things with this data set. For instance, we can plot the evolution of the ranking for any given song.
I'll just call the plot function from Matplotlib on the first song. And columns ranging from the first to the 76th. Matplotlib is usually pretty good at setting up a plot based on the index of a Pandas series. But in this case, it couldn't.
So, we're going to tell plot explicitly to use a range of one through 76. So, here's what the song did. It started really high then climbed to the top within eight weeks. Let's plot several songs at once. We can iterate over rows with the DataFrame method iterrows, which yields both the index and the content of the row.
We select the columns we wish to have, as before. Let's have a single color for our songs. And some transparency. We see that most songs don't last very long in the rankings. Let's get to cleaning. One problem with this DataFrame is that the rankings are not very usable for analysis because they're divided over multiple columns.
It would make more sense to have each ranking in a separate row with the week number as a variable. Technically, this is called melting each row into multiple ones, each of which represent a single ranking. For simplicity, I will work with a subset of the columns. So, let's create a new DataFrame, bshort, by using brackets to select a subset of columns.
Let's have a look. We should also rename the columns to better, simpler, and more consistent variable names. We can do that by assigning directly to the columns attribute. So, we'll be more consistent in the column names for weeks. And again, let's have a look. Next we get to melting using the Pandas melt method.
Melt requires that we specify which should be the identifier variables that are repeated for several rows. In our case, we need artist, track, time, and date.entered. Next we tell melt which should be the values, or observation columns. In this case, week one, week two, and week three. Next what should be the name of the column that holds the type of observation.
It would be week. And what should be the name of the column that holds the value of the observation. It should be rank. Let's try. To see that this works, it's better to select a specific song. We can do that with the Pandas method query which takes a query as a string in something like a natural language specification. For instance, track equals Liar.
Very good. We can do actually even better by converting the week to a number, which you can do using the apply method on the Pandas series for the column week. Apply takes a Python function. And we're going to define it on-the-fly using lambda. We need the third character in each string and we need to turn that into a number, an integer. Very good. Now, we can assign this series to the column week itself.
Next we should also turn dates into proper datetime objects. We've seen we can do this using the Pandas to_datetime method. And assign the results to the column date.entered. Once we have true dates, we can do date arithmetic and apply, for instance, Timedeltas, obtaining the correct date for each ranking in week one, two, and three.
For instance, we could compute the column date.entered for the first record plus a Timedelta of seven days. So, we do an operation involving two separate columns. We start with the date.entered and then we add a Timedelta of a week multiplied by the column week.
Minus one, because date.entered is already the date of the first week. And we assign the result again to a new column called just date. At this point, we may as well drop the column date.entered. This is done with drop. We need to tell Pandas that we're working with the columns, so axis 1. And we do this inplace.
Let's have a look. Finally, we sort both the columns and the rows. Artist, track, time, date, week, and rank. And to sort the rows, we use the method sort_values. We have obtained what we can call is a tidy table, but in doing so we have created some redundancy.
The artist name, the track name, and the track length appear in multiple rows. The correct way to remove the redundancy is to create a separate table with track data and linking it to the rankings using an index. In the context of relational databases, this is called data normalization. So, let's first get a table of tracks. We select the columns that we need and then we drop all the duplicates.
The DataFrame object is called bfinal, let me correct that. The index is unique and we need to carry it explicitly as a column. So, we reset the index and rename it id. I can assign directly to index.name. And as always, let's have a look.
We now perform a database style join operation between the new table of tracks and the Billboard table. We do so by matching by artist and track. So, Pandas merge to obtain a join between tracks and Billboard. I'm going to merge, so to match rows based on track and artist. And look at the beginning. This works.
It added the id column to tracks, but it duplicates the time column. To avoid that, we need to add it to the merge. Then we can drop the columns we don't need. Here's a simplified table. And with this we can quickly compute things such as the highest rank achieved by a song in weeks one, two, and three.
To do that, we select rows based on the week, setting it equal to one, for instance. Then we take the rank and look at the index for which the minimum is achieved. And finally, we can look at what that song was. To get artist and title, we look at the tracksid table.
This is all for this example of cleaning data.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference