Learn how to concatenate and transform data.
- [Instructor] Knowing how to concatenate and transform data is really important in data analysis. Concatenation and data transformation are useful for getting your data into the structure and order you need for analysis. For example, imagine you're mailing out a piece of direct mail advertisement. You have one table with customer ID and name, and you have another table with customer ID, mailing address, and age. You mailing address application requires you to supply it, one table that contains only customer name and address. You generate this table by concatenating your two tables by customer ID row wise.
Concatenating is simply combining data from separate sources. Transformation, on the other hand, is converting and reformatting data to the format necessary for your purposes. When you transform data, you convert it into the format that's required to facilitate analysis. In this demonstration, you're going to learn how to drop data, add data, and sort data. Going back to our example, transformation would be when you drop the age column in order to get your data into the exact format that the application would need. Let's look at how this works in practice.
In this demonstration, you're going to need to use the numpy and pandas libraries, so you want to import those as usual. And then be sure to import the series and data frame from pandas. Before I can show you how to concatenate data frames we need to have some data frames to concatenate, so let's just build those. The first one we're going to call it DF_obj for data frame object, makes sense, right? And then we'll create a second data frame called DF_obj_2.
So first let's call our data frame constructor, and then we'll pass in the np.arange function and tell it to generate a series of 36 values, and put those into a matrix that's six by six, six rows and six columns. Okay, and then print that out just so you can see what it looks like. And then DF_obj_2, let's just call that Data Frame Two. You're going to repeat the same data frame constructor, and then we'll just pass in np.arange this time with 15 values, and we'll put it in a five by three matrix.
So we'll reshape, and then (5,3). Print that out. Okay great, so we have two data frames. Now let's practice concatenating them. To concatenate data you use the concat method. This method joins data from separate sources and combines them into one data table. If you want to join data tables based on their row index values, all you have to do is call the pd.concat method and then pass in the axis=1 argument.
The axis=1 argument tells Python to concatenate the data frames by merging them along the row index values, and this results in an output table that is wider than, or in other words has more columns than, the individual data tables it was made from. So let me show you what I mean here. We'll call pd.concat, and then we want to call this on our two data frame objects we just created. So write them in, and then make sure to pass in axis=1.
So I need to just remove this little e, rerun, and here we have a combined data table. Now let's look back at our original data frames. See, they're both shorter. What's happened is they've been joined along the row index value so the output table is wider. This table is on the left, and the second data frame is on the right as you can see here looking at the column index values. But on the other hand, if you want to merge the tables along the column index values, that's going to result in a table that's longer than, or in other words has more rows than, the tables it's made from.
So to do this you can just call the concat method and not pass in an axis argument. That's because by default Python is going to concatenate based on column index values. So let me show you what I mean. We'll just call the pd.concat function, and then pass in DF_obj, DF_obj_2, and notice how we're not passing in an axis argument.
So by default this is what pandas will give you. It will concatenate based on the column index value so you're going to get a longer table. These are from our first data frame, and then our second data frame has been concatenated onto the bottom. After you concatenate data tables you often need to reformat the resulting table. To drop rows from a data frame you can use the .drop method and pass in the index values for the rows you want dropped. Let me show you.
We'll just take our DF_obj and then call the drop method off of it, pass in values 0 and 2 index values. And look, as you can see here the rows with the series index values 0 and 2 have been dropped from the original data frame. It's just as simple to drop columns. The only thing you need to do differently is to make sure you pass in the axis=1 argument. Check it out. We write the name of our data frame, DF_obj, and then we call the drop method off of it, and pass in the same two index values, 0 and 2, but this time we're going to pass in axis=1, and then that tells Python to drop the columns with those index values.
Let's see how this looks. So as you can see here the columns that have the index values 0 and 2 have been dropped from the data frame. Now I want to show you how to add data. In order to do that, let's create a series object and then we'll add it to a data frame. So we're going to use np.arange function and pass in value of 6, so it's going to create a series of six numbers. I'll write series_obj, and then let's just name it added_variable.
Make a string, added_variable and print it out. And a good way to add data to a data set is to join data frames. To do that, all you need to do is call the join function on them. So in this case, let's join the data frame we created to the series object we just created. So we'll put DataFrame.join and then pass in the name of our objects, and we will call this join data table variable_added.
Let's print that out. And as you can see here, a column has been added to our original data frame. This is our original DF_obj data frame from zero to five, and then this added variable is the series we just created. Another way to add data to a data set is to use the append method. This method allows you to add rows to the bottom of a table. I'm just going to give you a simple example using the variable_added data frame that we just created.
So what I need to do is just call variable_added, and then write the append method off of it and pass in the same object name, variable_added. And for the first run let's just do ignore_index argument that tells Python whether it should reindex its data frame or not. First we'll say False. Okay, let's call this added_datatable and then print it.
Now if you look here, our variable_added table has now been appended to itself. So here's our original variable_added table, and then here's another copy of it extended downward, so you have double length. But the thing I really wanted to point out here is when you use the append method you always have to pass in the argument ignore_index whether or not you want pandas to reindex your data frame for you.
We passed in False, so in this case it didn't reindex our data table and we have duplicate values, which isn't useful. So it's a good idea to always reindex your data frame. So, let's just do that by passing in ignore_index=True. And I'll show you what you get with that. Okay, cool. So what's happened here is our original variable_added table has been added to itself and reindexed.
So you get the same values in the data table but new index values, so you get a unique index for each row. The last thing I want to show you is how to sort your data. For that you use the sort values method. With this method, you always pass in an argument called by. The by argument tells Python what column you want the data table to be sorted by. Let me show you. We'll write our DF_obj data frame and then call the sort_values method off of it, and pass in by=5.
Then we'll tell it ascending=False. Basically, we want Python to sort the data table by the column that's indexed, five, and we want it to sort it in descending order. Let's call the output table DF_sorted and then print the whole thing. Now as you can see here the values in the column marked five are ranked in descending order.
Look over at the series index values. See how they are not in numerical order? That's because the table has been sorted to place column five's values in descending order. That's it for concatenating and transforming data. Just remember, these methods are super useful when you're working with separate data sources and you need to put them into one combined data table.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup