Learn how to load a CSV in Pandas. You can learn how you can check data size before loading it to memory (both in pure Python and in shell commands). Learn how to get initial statistics and information on the loaded DataFrame.
- [Instructor] We're going to explore Pandas by processing the log of one of my jogging sessions to see how slow I run. We're going to use the data found in track.csv. You can access the file from the exercise files. Let's start with what I called a first look at the data. Pandas will load the whole CSV into memory. It's a good idea to have a quick look at the data to make sure we're not loading corrupted data, or a file that's too big to fit in memory. We'll talk later on what to do when the data is too big to fit in memory. The first thing I do is take a look at the size of the file.
I'm going to do it both ways. In the shell command, which work only on Macs and Linux, and pure Python, which works on all operating system. Let's start a new notebook. Click on new in Python 3, and let's name this track. If you remember, we said we can easily execute shell commands from Jupyter. Let's assign the name of the file to a variable. In my case, I've extracted the exercise file to my desktop. So I write from os import path.
Then fname equal path.expanduser, and then the path to where you file is. I need to remove the quotes that Jupyter added for me, and execute that. On your machine the path of the file might be different. If you're on Windows, and copy and paste the file path from Explorer, you might have an issue with backslashes. The backslash has a special meaning in Python strings. For example, backslash n is the new line character. Here is an example. Let's print some file.
Let's say it's in C path to nowhere.csv. If you see what is printed, the backslash n is interpreted there's a new line. The backslash t is interpreted as the tab character. If you try to open this file, it will fail. The easiest solution is to add an r in front of the string. Let's do that, and execute this one. These are called drawStrings in Python. In drawString, the backslash has now special meaning. This can be a bit of an aside for what we are doing but for Windows users, this can save a lot of time and frustration.
I'm going to show some command line utilities, which work only on Mac or Linux. If you're on Windows, run only the pure Python commands, and ignore the commands starting with an exclamation point. Let's get back to our file. To view the size, we can either use the LS command. So I write !ls -lh, and then the file name in quotes. You see that the file is only 43K, which we can save a lot into memory. The same command in pure Python is path.getsize of file name.
This is in bytes. We'll like to see it in kilobytes. So we can do path.getsize of fname divided by one shifted to 10. The smallest smaller than operator is the left chip operator. Basically two raised to the power of 10. Now let's look at the start of the file. First with the shell command, head, and in pure Python, with open file name as fp for line number and line in enumerate fp.
Enumerate is a function that gives us the line number, and the line itself. If the line number is more than 10, we'll stop. Otherwise we'll print the line without the new line at the end. It's a good habit to open file with a with statement. So to make sure that they are closed once we're done with them. In both cases, we see that we have a CSV file with four columns. CSV is a very common format that can easily be imported or exported to and from Excel.
Because of this, you'll see CSVs everywhere. But it's not the best format to store your data since it doesn't store type information, and can be difficult to get right with textual data. Let's see how many lines there are in the file. We start with the UNIX shell command, wc -l and then the file name. In Python, with open fname as fp. Print sum of one for every line in fp. I highly recommend that you do these steps before you add data blindly into a data frame.
To help you understand what the data is, and check there are errors. Now let's load it into Pandas. First, we import Pandas, import pandas as pd. Then we say df equal pd.read_csv fname. With rotation, we call the variable holding the data frame as df. Now let's see what we have here. Let's start by looking at how many rows there are. If you remember, when we did the row count before, we got 741.
So where did one row go? The answer is that Pandas uses the first row for the column names, as we can see in df.columns. We can get some general information with df.info, which will tell us how many rows there are, and what are the columns. We can also see some content with df.head, which shows us just the beginning of the data frame.
- Working with Jupyter notebooks
- Using code cells
- Extensions to the Python language
- Markdown cells
- Editing notebooks
- NumPy basics
- Broadcasting, array operations, and ufuncs
- Folium and Geo
- Machine learning with scikit-learn
- Plotting with matplotlib and bokeh
- Branching into Numba, Cython, deep learning, and NLP