Learn about reading CSV files in pandas, accessing DataFrame rows and columns, selecting ranges of data, and using pandas datetime objects.
- [Instructor] In this video, we look at importing a DataFrame into Python using the powerful pandas library. And we'll also look at basic pandas functionality. For a refresher, you can see my course, Python: Data Analysis, as well as several other courses in the Lynda LinkedIn library. I've opened up the exercise file for this video. The file planets.csv, which we have created in the last video, is found in the same directory. So we read it into a pandas DataFrame using the pandas command read_csv.
First, though, we import some libraries. Here, I'm using the auto-completion feature of the Jupyter notebook interface, which is accessed using the tab key. There are many other reader functions in pandas which can read JSON, HTML, Excel, HTF, Status, SAS, SQL and more.
You should see the pandas manual for that. So let's display our DataFrame in a notebook. It looks smart and clean. Read_csv is a very sophisticated function that can handle subsets of tables, missing data, errors, special interactions for parsing dates, indices and much more. For instance, we could just select a subset of columns to use from the CSV file.
Let me copy the cell from above. And add the option, usecols. This way, we only get a subset of columns. We could also replace the column names, skip the header and much more. So let's look quickly at our DataFrame object. Basic indexing on the object, with brackets. Selects columns, such as mass.
This yields a simpler pandas object which is called series. We can also use the dot notation to select a column. A very important pandas notion is the index of the rows. Here, it was created for us with a basic numeric range, zero through nine. The end of the range at 10 is exclusive, as is normal in Python.
We can use the index value to select a row. This is done with a special indexer loc, L-O-C. For instance, we look at Mercury. But perhaps, instead of numbers, it may be better to use the planet name as the index. We achieve that using the method set_index on the DataFrame object.
The result is the DataFrame indexed by planet name. Note that most operation in pandas result in copies of the DataFrame object and do not modify the original one. If we do want to modify it, we can add the keyword, inplace, and set it equal to true. We can find out how many rows there are in a DataFrame using the method, info, or just taking the length of the object.
Since we have replaced the index with planet names, now we can get at individual rows using again the loc indexer with a planet name. Loc also accepts ranges which, however, are inclusive. I will write the range as a regular Python slice with a colon.
In pandas, the column names are also collected in an index object. We get at that using the attribute columns. Let me load the DataFrame again so we have access to all the columns that are present in the CSV file. And let's again set the index to planet names.
In pandas, there can be multiple ways to obtain the same result. It is best to find one that makes the most sense to you so that you will remember it easily. To access one specific number, for instance, you can choose first the column, first the row or both at the same time, with a notation that resembles NaN pie. Let me show you all three. Planets, column firstvisited, and then Mercury.
Or planets.loc, Mercury. And then variable firstvisited. Or again, planets, using the loc indexer, and then row, Mercury, and column, firstvisited. Speaking of which, the column firstvisited holds strings. Let's look at a Python type of this object.
Indeed, a string. pandas in fact has a much smarter daytime object, which we should use whenever we can. We use the pandas function to.datetime to convert strings to datetimes. We can apply the function to an entire column. And we can assign the result to the column itself.
Then, we can use datetime functions, which are accessed through dt, to do interesting things on the dates. For instance, we may isolate the year. And we may take differences between two years. This tell us, for instance, that the Moon was first visited 59 years ago. Although I've been using pandas for quite a while, I still find it counterintuitive at times.
That probably must be the case because it's such a powerful library. However, the documentation is very clear, there are excellent books, and you can always go to websites like stackoverflow.com, where pretty much any question about pandas has been asked and answered already.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference