In this video, see how to use read_csv().
- [Narrator] Data input and validation. In this video, we will cover how to read data into a Pandas DataFrame and validate the input. You can read data stored in a wide variety of formats, such as excel, json, or SQL database tables, amongst others. Since our file is a CSV file, we will use read_csv which will allow us to read a comma separated file into a DataFrame. Read_csv has numerous parameters and is very feature-rich.
You can specify a wide variety of options including what column headings to use for series, additional delimeters for files, and ways to parse any data fields. We could easily spend 30 minutes just looking at the different parameter options for read_csv. For now, we will only focus on just a couple of the parameters. The required parameter to read_csv is the path to the CSV file. The skiprows parameter allows you to do exactly that. If the first few lines do not hold any relevant data, then skip them.
We will now head over to our Jupyter Notebook to work through this. The first thing we'll want to do is to import Pandas. We now need to specify the name for our DataFrame. I'm going to use oo = pd.read_csv and I can hit Tab and this will give me the options of all of the different types of formats that I can read in. Since we're using a CSV file, I'll select read_csv.
And this is a very helpful tip. If my cursor is in between those brackets, I can just hit Shift and Tab, and this will provide me documentation. If I hit Shift and Tab again, I will get further documentation, and then I can hit Shift and Tab three times, and that will provide me all of the options available within the read_csv file or any method within Pandas. I now need to specify the path to that CSV file. Since I know that the name of my file is olympics, I can just type O and hit Tab, and it auto completes that for me.
I run the cell and now I want to look at the results stored in that DataFrame. We seem to have a problem here. The first couple of rows don't seem to make sense. This is not what we want, so let's head back to our CSV file to try and understand why this might be the case. If we head back to the CSV file, you can see that we don't want to read the first couple of rows into our DataFrame. The table of information that we want really starts from here, so we want to skip those first four rows and read from row five onwards.
So we'll head back to our DataFrame. We can provide the parameter, skiprows=4. We now see that this is what we would expect to see in our DataFrame. This starts at the start of the table of the CSV file so the first athlete is HAJOS, Alfred. We go back to that, and we see row six, HAJOS, Alfred and we know that we are good to go. In the next video, we will look at the shape attribute.
- Working with plots
- Boolean indexing
- String handling
- Grouping data
- Creating your own colormaps