Data is frequently presented in raw data files called CSV files. There's also a lot of data you don't need. You show how these files can be opened using Haskell.
- [Instructor] Welcome to descriptive statistics. Descriptive statistics are used to summarize a collection of values into one or two values. We begin with learning about the Haskell Text.CSV library. In future videos we will cover, in increasing difficulty, the range, mean, median, and the mode, and you've probably heard of some of these descriptive statistics before. They're quite common. So in this video we're going to cover the basics of the CSV library and how to work with CSV files.
So in this video we're going to take a closer look at the structure of a CSV file, how to install the Text.CSV Haskell library, and how to retrieve data from a CSV file from within Haskell. Now to begin, we need a CSV file. So I'm going to Tab over to my Haskell environment. Which is just a Debian Linux virtual machine running on my computer. And I'm going to go to the website retrosheet.org. Retrosheet.org is a website for baseball statistics and I'm going to use them to demonstrate the CSV library.
Now if we find the link for data downloads and click Game logs, and scroll down just a little bit, we have game logs for every single season going all the way back to 1871. For now I would like to stick with the most recent complete season. Which is 2015. So go ahead and click that link. We will have the ability to download a Zip file. So go ahead and click OK. And I'm going to Tab over to my terminal.
Let's go into the downloads folder, and if I hit LS I see that there's our Zip file. Let's unzip that file and see what we have. Let's open up GL2015.txt. This is a CSV file. A CSV file is a file of comma separated values. So you'll see that we have a file divided up where each line in this file is a record and each record represents a single game of baseball in the 2015 season, and every single record is a listing of values separated by a comma.
So the very first game in this dataset is a game between the Saint Louis Cardinals. That's the SLN, and the Chicago Cubs, that's the CHN, and this game took place on March fifth 2015. The final score of this game was three to zero. And every line in this file is a different game. So I'm going to scroll down and just demonstrate that we have many many games represented in this file, and lots of information that we could look at.
So I'm back up at the top of the file. Now CSV isn't a standard, but there are a few properties of a CSV file which I consider to be safe. Consider these my suggestions. Now a CSV file should keep one record per line. The first line should be a description of each column. In a future video I'm going to tell you that we need to remove this header line, and you'll see that this particular file doesn't have this header line. I still like to see the description line for each column of values. If a field in a record includes a comma, then that field should be surrounded by quote marks.
Now we don't see an example of this, at least on this first line, but we do see examples of mini values having quote marks surrounding the files such as the very first value, the date. In a CSV file with a field that's surrounded by quote marks, that is optional unless it has a coma inside that value. While we're here I would like to make a note of the tint column in this file. And that contains the number three on this particular row.
That represents the away teams score in every single record of this file. Make a note that our first value on the 10th column is a three. We're going to come back to that later on. Let me get out of my file. Our next task is installing the Text.CSV library, and we do that using the Cabal tool. Which connects with the hackage repository and downloads the Text.CSV library. The command for that is cabal install csv.
It takes moment to download the file, but it's going to download and install the Test.CSV library in our home folder. Now let me describe what I have currently in my home folder. I like to create a directory for my code called code, and inside here I have a directory called Haskell data analysis, and inside Haskell data analysis I have two directories called analysis and data. In the analysis folder I would like to store my notebooks.
In the data folder I would like to store my datasets. That way in can keep a distinction between analysis files and data files. That means I need to move my data file that I just downloaded into my data folder. So copy from our downloads folder our GL2015.TXT into our data folder. If I do an LS on my data folder, I'll see that I've got my file. I'm going to go into my analysis folder.
Which apparently contains nothing, and I'm going to start the jupyter notebook. Now jupyter is spelt J-U-P-Y-T-E-R and what the jupyter notebook does is it starts a web server on your computer and it uses your web browser in order to interact with Haskell. And the address for the jupyter notebook is local host at port 8888. Now I'm going to create a new Haskell notebook. And I click over here on the New command on the right side of the screen and I find Haskell.
Let's begin by renaming our notebook baseball, 'cause we're going to be looking at baseball statistics. I need to import that Text.CSV file that we just installed. In order to submit an expression to the jupyter environment we need to hit Shift Enter on the keyboard. Now if you're just in a box and you hit Enter you're just making that text field larger. You hit Shift Enter in order to submit expressions. So now that we've imported Text.CSV, let's create our baseball dataset and parse the dataset.
The command for that is parseCSVFromFile and then we pass in the location of our text file. Great, and if you didn't get a file not found error at this point, that means you have successfully parsed the data from the CSV file. Let's explore the type of baseball data. So to do that we hit type and baseball. Which is what we just created, and we see that we have either a parsing error or a CSV file.
Now I've already done this so I know that there aren't any parsing errors in our CSV file, but if there are, they would be represented by parse error. So I can promise you if you've gotten this far, I know that we have a working CSV file. Now I'll be honest, I don't know why the CSV library does this, but the last element in every CSV data is a single empty list, and I call this empty list the empty row. What I would like to do is to create a quick function called no empty rows.
That removes any row of data that doesn't have at least two pieces of information in it. So either, and if we have a parsing error we're just going to return back an empty list, or if we actually have data, we're going to filter out any row that does not have at least two pieces of information in that row. Now let's apply our no empty rows to our baseball dataset. I'm going to call this baseball list.
Now we can do a quick check to see the length of the baseball list. We should have 2,492 rows representing 2,429 games played in the 2015 season. Now let's look at the type of baseball list. And we see that we have a list of fields. Now you may be asking yourself, what's a field? Now we can explore a field using info. And it's going to bring up a window from the bottom of the screen.
I'm going to scroll this up a little bit and it says type field is equal to string and it's defined in this Text.CSV library. So just remember that a field is just a string. I'm going to close that. Now because every value is a field, which is a string, if I do math on strings that's going to produce an error message. So what I need to do is I need to parse that information form a string to something which I can use. Such as an int or a double. And I do that with the read command.
So if I say read one, I can parse that as an integer. Or I can say read 1.5 and I can parse that as a double. So armed with this knowledge of parsing data from strings, we can parse a column of data. So I would like to create a function and we're going to call this function read index, and now I'm going to just say that each value is a cell in our case. So for each cell in our dataset we're going to pass in our original baseball dataset.
That is an either, and we're going to say that we need an int index position into our list and we're going to return a list of cells, and this requires two arguments. The CSV and the index position that we need. And we are going to map over each record and we are going to read whatever exist at the specified index position. We also need the no empty rows that we discussed earlier.
Good. Now if you recall earlier, I said that the away team scores in our CSV file exist on column 10, and because Haskell is a zero based index file, that means we need to pass in index nine to our read index function. So read index baseball nine. And we're going to parse this list that's returned as a list of integers, and there we have a listing of every single away team score in major league baseball, and the very first element in our list is a three, because that is the first record of the file.
So in this video you learned about the structure of a CSV file, you learned how to install the Text.CSV library, and you learned how to pull a little bit of information out of that CSV file using the CSV library. So in our next video we're going to discuss how to create our own module for descriptive statistics and how to write a function for the range of a dataset.
Note: This course was created by Packt Publishing. We are pleased to host this training in our library.
- Data ranges, means, and medians
- Standard deviation
- SQLite3 command line
- Slices of data
- Regular expressions
- Atoms and modifiers
- Character classes
- Line plots of a single variable
- Plotting a moving average
- Feature scaling
- Scatter plots
- Normal distribution
- Kernel density estimation (KDE)