Charles Kelly explains how to create Pandas objects using Python data structures.
- [Instructor] The Object Creation file from your Exercises File folder is pre-populated with import statements for Pandas and NumPy, a series data set and date time index. Pandas uses NumPy's fast and efficient methods for accessing arrays. The purpose of this video is to provide a rapid introduction to Pandas' capabilities and to build your intuition for Pandas. Since this is a rapid introduction, we'll cover many distinct techniques without providing much context for why you might use any particular technique.
We'll reuse the techniques and provide context for the techniques throughout the course. Begin by placing your cursor in the cell that contains the import statements. Press shift enter. This changes the names space for Pandas and NumPy. Instead of typing n-u-m-p-y as a prefix for all of NumPy's functions, we can simply type n-p. Similarly with Pandas, we can simply type p-d.
Series and DataFrames are the primary data types within Pandas. These data types are implemented using NumPy data structures. In this cell, we define a series and initialize it with a Python list. Notice the inclusion of NumPy's not a number. Pandas provides several ways of managing missing data including the use of not a number to identify missing data. Place your cursor in this cell. Press shift enter.
Now you'll see the contents of my series. A date time index is a tool for performing date arithmetic on Pandas Series that you choose to treat as time series. In this case, the date time index begins on the first of January 2016 and uses the default frequency of d which indicates daily. You can execute this cell and see the results. January 1st, January 2nd, January 3rd, et cetera.
Now, we'll chain together NumPy methods to create a sample data set that we will use to illustrate some of Pandas capabilities. I'm copying and pasting from the final version of the Object Creation file in the Exercises File folder. Once you've pasted this information, or type it if you prefer, press shift enter. This displays the NumPy array. Note that it contains 24 elements including six rows and four columns beginning with zero and ending with 23.
If you're having difficulty remembering NumPy functions, I highly recommend that you watch my course on NumPy in the Lynda.com library. In this cell again, I'll copy and paste from the final version from your Exercises File folder. Here, we're going to create a sample data frame. We're using Pandas DataFrame function with the sample data that we created using NumPy, the index that we created and finally a name parameter called Columns which contains the data ABCD, which will serve as the column headers for the second through fifth columns in our array.
Press shift enter and you'll see how Pandas formats this data from the NumPy array. Note that the data frame is displayed in six rows and four columns. This is how we reshape our data using NumPy. NumPy's a-range function returned integers and the data are displayed in the data frame as integers, that is they are displayed without a decimal point. Note that the dates that we created are used as an index into the Pandas data structure, one date for each row.
We can also create a data frame using Python dictionary. Notice the curly braces inside the function call. The dictionary keys are used as the column headers, float, time, series, et cetera. Since the keys are strings, the columns will be sorted into alphabetical order by key. To see this, press shift enter. Notice that when the values associated with the key have a single value such as in the float key pair relationship, each of those single values will be displayed in every row of the column.
Again, pasting from the final file folder, we can use the dtypes attribute to find information about the data frame. Notice that the data frame retains the types from the Python dictionary. Once again, pasting. When working with large data sets, it is sometimes convenient to display only a small subset of the data. The data frames head function displays the first few rows of data. If no argument is given, five rows of data are presented.
In a similar fashion, we can display information from the end of the data frame. In this case, we're using in argument two and telling Pandas to display the last two rows of the data frame. In this case, I'm using the notebook's tab function to pre-populate a cell with information. Here I'm asking for the sample data frame and using its attribute values. Pressing shift enter, we get the underlined NumPy array for the Pandas data frame.
I can obtain the index in a similar fashion. Finally, if I want to obtain the columns, I can do that using the columns attribute. You can obtain a quick statistical summary of the data within a data frame by using the describe function. Notice that although the underlying data are integers, the data summaries, that is the statistical summaries are presented as floating point numbers. You can control many aspects of Pandas with the set options function.
You can see some of these options by navigating to this URL. For example, I can set the display option's precision to two decimal places. Now, if we use the describe function once again, we'll see that the information is presented with two places after each decimal point instead of six places after each decimal point. I can transpose the rows and columns within a data frame by using the T attribute. Here we can see that the column headers are now the dates and the previous column headers are now the row indices.
The information is again presented in integer format. You can sort data along a particular axis by using the sort index function. Note that we use the axis parameter set equal to one for columns and zero for rows. You can sort a data set values by using the sort values function. In this case, we set the by parameter to B indicating that we want to sort the information by the values in column B.
In this video, we covered diverse techniques including how to create series and data frame objects. We also covered how to perform some elementary operations upon these objects.
Watch this course to gain an overview of Pandas. Charles Kelly helps you get started with time series, data frames, panels, plotting, and visualization. All you need is a copy of the free and interactive Jupyter Notebook app to practice and follow along.
- Using the Markdown language and Jupyter Notebook
- Creating objects
- Selecting objects
- Using operations
- Merging data
- Creating series
- Creating data frames
- Creating panels
- Annotating plots and data frame plots