Join Michele Vallisneri for an in-depth discussion in this video Downloading and parsing data files, part of Introduction to Data Analysis with Python.
- We'll start by loading data about the weather stations. I will show you how to download a file over FTP, using Python. And I will show you how to parse a space-separated text file into a Python dict that we can use later. Let's navigate to the location in the filesystem where you can find our exercise files. All the work that we do in this chapter will flow from one video to the next, so I really encourage you to watch them in sequence. We select the exercise file for this first video, 05_02_stations_begin, which starts as an empty notebook.
We begin by loading a few basic packages: Numpy, of course, which we will nickname np. Matplotlib, since we wish to plot data. And seaborn, which is an extension to matplotlib, but just makes the plots prettier. We also import seaborn, an extension to matplotlib, that improve its default plot formatting and that also implements additional plot types. We also instruct the Python notebook to keep all the plots inline.
The weather data that we'll be using are found at the website of the NOAA National Centers for Environmental Information. The climate data online service provides access to an archive of global historical weather and climate data. And specifically, we will use data from the GCOS Surface Network, a global reference network of observation stations. We start by downloading a text file that contains an annotated list of stations in the network. I've included copies of all the files that we need in your exercise files.
However, for this first one, I will also show you how to download it yourself. We could do it with an FTP client or a web browser, but let's use Python. Specifically, we can use the Python standard module, urllib, which we need to import. The function that we need is urllib.request.urlretrieve. We then need the full address of the file that we want to download. The second argument for this function is the local name of the file.
Good, in this case, no news is good news. The file has been downloaded. This function works in Python 3. If you're using Python 2.7, please see the FAQ for this course. Let's have a look at the first lines of the file. We'll open it for reading. Use readlines and use slicing to select the first few items in the list. What we see is a station code in each line followed but what may be a geographical location, probably longitude and latitude, by what could be a sea level height in meters, and by a name.
Some stations are tagged as GSN. That's the GCOS Surface Network. We'll concentrate on those. So let's gather some data from this text file. We go through, read all the lines. We skip those that do not have the GSN keywords. And we just collect the station names in a dictionary indexed by the station code. Remember that in Python, it's possible to iterate through an open file which will just returns the lines, one by one. We check if the string GSN is included in the line.
And if it is, we take the line, split it, which will naturally split it by whitespaces, and assign the resulting fields to a Python list. We will then use the first item in the list as the key for our dictionary. And the fifth and all following items for the name of the station. Another use for the slicing operator. And let's join those strings, this time, using a space between them.
How many stations? 994, we should really concentrate on a few only. So let's write a function that lets us look for interesting patterns in the station name. We'll call it find station. And let's build a dictionary using a comprehension of the station codes and names where the pattern that we're interested in is found within the name. Let's just print it.
For instance, I really like Hawaii. Let's see if we can find a station there. Lihue is a town in Kauai. Indeed, it's there. Let's look for San Diego. A cold place, maybe Minneapolis. And an even colder place. Let's go to Siberia. Throughout the rest of this chapter, we'll look at data from these four stations. I will collect our codes in a Python list. Sometimes, copy and paste, rather than writing code, is the quickest way forward.
This concludes our work for this first video.
- Writing and running Python in iPython
- Using Python lists and dictionaries
- Creating NumPy arrays
- Indexing and slicing in NumPy
- Downloading and parsing data files into NumPy and Pandas
- Using multilevel series in Pandas
- Aggregating data in Pandas
Skill Level Intermediate
Q: The course shows how to download files from FTP and web servers using Python 3.X. How do I do the same thing with Python 2.7?
A: First <span style="font-family: Courier;">import urllib</span>, then use <span style="font-family: Courier;">urllib.urlretrieve(URL,filename)</span>. For instance, to download the stations.txt files used in the chapter 5 video “Downloading and parsing data files,” you’d do <span style="font-family: Courier;">urllib.urlretrieve(‘ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt','stations.txt')</span>.