Join Michele Vallisneri for an in-depth discussion in this video Integrating missing data, part of Python: Data Analysis.
- We have sucessfuly loaded temperature data…into NumPy Record Array.…We can begin to play with it…but we see that there is missing data…that needs to be integrated first.…We'll do that by first marking it as…not a number using NumPy Boolean masks.…And then, by replacing it with Interpolated values…using the numpy.interp function.…Let's go to the Python notebook.…Let's grab the minimum and maximum temperatures…here for "lihue"…and let's try to plot one of them.…
We notice immediately that there's something strange.…It must be the - 999.9 value associated…with missing observations.…Let's change that to something more representative…such as "nonpy.nan" for not a number.…We can do this by modifying the "getobs" function.…We save the data to a variable…and then use a NonPy Boolean Mask…to select only the values equal to - 999.9…and then change only those to "nan."…Reassign the "lihue_tmax" and "tmin"…and plot again.…
That's better.…The plotting ignores the "nan" values.…So let's plot "tmax" and "tmin" together.…Again, this makes sense.…
Released
11/12/2015- Writing and running Python in iPython
- Using Python lists and dictionaries
- Creating NumPy arrays
- Indexing and slicing in NumPy
- Downloading and parsing data files into NumPy and Pandas
- Using multilevel series in Pandas
- Aggregating data in Pandas
Skill Level Intermediate
Duration
Views
Q: The course shows how to download files from FTP and web servers using Python 3.X. How do I do the same thing with Python 2.7?
A: First import urllib, then use urllib.urlretrieve(URL,filename). For instance, to download the stations.txt files used in the chapter 5 video “Downloading and parsing data files,” you’d do urllib.urlretrieve(‘ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt','stations.txt').
Q. What are the issues with DataFrame.sort()?
Â
A: Since Pandas version 0.18, the DataFrame method sort() was removed in favor of sort_values(). Unlike sort(), the new method does not sort records in place unless it is given the option "inplace=True". The following lines of code in the video need changing:Â
- In Chapter 6: Introduction to Pandas/DataFrames in iPandas
- twoyears = twoyears.sort('2015',
ascending=False) -> twoyears = twoyears.sort_values('2015', ascending=False)
- In Chapter 7: Baby names with Pandas/A yearly top ten
- allyears_indexed.loc['M',:,
2008].sort_values('number', ascending=False).head() - pop2008 = allyears_indexed.loc['M',:,
2008].sort_values('number', ascending=False).head() - def topten(sex,year):
- simple = allyears_indexed.loc[sex,:,
year].sort_values('number', ascending=False).reset_index()
- In Chapter 7: Baby names with Pandas/Name Fads
- [in addition to lines above, which are used to initialize the "name fads" computation]
- spiky_common = spiky_common.sort_values(
ascending=False) - spiky_common = spiky_common.sort_values(
ascending=False); spiky_common.head(10)
- In Chapter 7: Baby names with Pandas/Solution
- [in addition to lines above, which are used to initialize the "name fads" computation]
- totals_both = totals_both.sort_values(
ascending=False)
Q. What are the issues with Pandas categorical data?
Â
A. Since version 0.6, seaborn.load_dataset converts certain columns to Pandas categorical data (see http://pandas.pydata.org/
Q. What are the issues with matplotlib.pyplot.stackplot? Â
A. In recent versions of matplotlib, the function matplotlib.pyplot.stackplot now throws an error if given the keyword argument "label". This problem occurs in the "Baby names with Pandas/Name popularity" exercise file, and it can be ignored. In the video, matplotlib does not complain, but nevertheless shows no legend for the plot. The tutorial moves on to show how to make a legend using matplotlib.pyplot.text.
Share this video
Embed this video
Video: Integrating missing data