Data is frequently presented in raw data files called CSV files. As you saw in our previous section, CSV files can be large and troublesome to work with. A much better format is SQL, which allows us to query our data in sections rather than entire data dumps.
- [Narrator] SQLite3 is a file format for storing data in the same mindset of a relational database, such as Oracle, MySQL, MariaDB, Postgres. While I suppose that you could use SQLite3 as a traditional database engine, it's really not made for that. SQLite3 allows us to open up a file, work with the data in that file using SQL statements, and then close that file. One file can contain one database, and a database can store multiple tables of information.
Contrast this with the CSV file format, which we used in the last section, and CSV file formats, at best, store a single table of information. The advantages of SQLite3 are that it doesn't require any server side programs running in the background, there are no configurations to discuss, and the executable for working with SQLite3 is a single file. So, in my opinion, working with data in an SQLite3 file format is more versatile than working with data in a CSV file format.
In this section, we'll focus on how to get our data from CSV, into SQLite3. Then, we'll be training ourselves to think of our dataset in terms of a database, using SQL statements to fetch data. This requires an understanding of SELECT queries. We do need to grab some code for translating the output from our select queries into Haskell, and we will demonstrate how to use our data summary functions from section one, with our database. So this video is going to be a primer on SQLite3.
There's actually not going to be any Haskell code in this video, we're going to take a moment and translate a CSV file into SQLite3. In this video, we're going to take a look at introducing SQLite3, we're going to create a table in a SQLite3 database, and we're going to add a CSV file to that table that we created in our SQLite3 database. So from here, I would like to move over to my virtual machine for our Haskell environment.
I already have my web browser open. So I've got the Google website opened up, and what I would like to do from Google, is to do a search for usgs earthquake feed csv. USG is the United States Geological Survey, and they keep a database of every single earthquake that takes place on the planet Earth. And they offer this data in CSV file. So we're going to click that very first link, you should see spreadsheet format at the top of the first link.
And I'm going to scroll down, to the heading over here that says past 7 days, and I want to find all earthquakes. And I'll have the option to download the all_week feed, for all earthquakes that have happened in the past seven days in the United States, so let's save that file. Let's go over to our terminal, and I'm going to do ls, and I see that we have, in our Downloads folder, an all_week.csv file, that's the file that I just downloaded, and what I would like to do is to convert this file to a CSV file.
Now let's take a look at the file. You can see that this particular file has a header line, and we can look at the header information of all of our columns, the first five pieces of information on every record will be the time, latitude, longitude, depth, and magnitude. And there's a lot of other information here that we could go through. But for our purposes, I just want those first five columns of information. What we can do in order to narrow this dataset down to just those first five columns of information, is to use a command called cut.
Cut's a very useful command for working with CSV files. And we have to pass it a delimiter, and the delimiter is going to be a comma, because this is a comma separated value file, and we need to know what fields that we would like to keep from our file. I said that we want the first five, and so we type in 1,2,3,4,5, and we separate them all from commas, and now I pass in my file name, and I'm going to name this earthquakes.csv. Now, that file as you recall, had a header line.
And we need to chop that header line out of the file. So there is another tool that we can use in Linux, and it's called tail, I type in tail -n +2 earthquakes.csv, I'm going to create a temporary file, that e.tmp file will be our original file of earthquakes.csv but without that header line. And now I can move my temp file over my earthquakes file, and this is the primed dataset that I would like to use, so let's take a look at earthquakes.csv.
This is all of the earthquake data over the past seven days, I can see that there are 1,585 earthquakes represented in this dataset. And we have the first five columns of information, which is the time, latitude, longitude, depth and magnitude. Next, I need to install SQLite3, and the SQLite3 development library. And in Debian, the command for this is sudo apt-get install sqlite3 and then I want libsqlite3-dev.
Great, now that SQLite3 is installed on my system, I want to create my first SQLite3 database. Typically, SQLite3 database files end with the extension sqlite3. And since this is a database from the US Geological Survey, I'm going to call my database usgs.sqlite3. Sqlite3 usgs.sqlite3. That just created the database file on our hard drive.
There's nothing in that file, there are no tables, what we need to do next is to create that first table. I already have the command in my clipboard for creating that first table, and I want to describe what this command does. We're going to create a table called earthquakes, and then I'm going to pass in the five columns from our file, and I'm going to name them time, latitude, longitude, depth, and magnitude. Now the first field is going to be of type text, so I put the word TEXT after time, but the remaining four fields are all going to be based off of floating point numbers, so I use FLOAT type for those four types.
This now has created the table in our usgs.sqlite3 database file, but it doesn't have any of the information in that table. What we would like to do is import our data from CSV into SQLite3. We need to tell SQLite3 in order to read from CSV, and the command for that is .mode csv. Next up, we need to perform the actual import. So the command is .import, and then I type in earthquakes.csv, because that was my earthquakes CSV file, and then we need to tell it the name of the table, and I named my table earthquakes.
Great, if you didn't get any error messages, good job. This means that you have created and imported all of the data from our past seven days USGS earthquake feed into an SQLite3 database. What I can do now, is run a select query on here. So let's do a really simple select query: SELECT * FROM earthquakes LIMIT 5; And what this means, I want five records from our database, but I really don't care which records they are, and give me all columns from those records.
So those are the typical results I see. Now, in order to get out of this interface, I hit CTRL+D on the keyboard. So great, if you've gotten this far, you should see a new file on your folder, called usgs.sqlite3, and it will have a little bit of information in it, if it's not the same exact size as mine, don't worry, there's going to be a different number of earthquakes going around the Earth depending on when you view this video, so your size of the file will more than likely be different from mine.
But you should have a few thousand earthquakes in that database. So let's pretend that we're working with multiple CSV files. We can repeat our steps in this video, and create a table in our SQLite3 database for each CSV file, and then import each CSV file into that database. This allows us to package many tables of information into a single file that can be represented with a clean, consistent interface. So SQLite3 has lots of advantages, but it's not enough; we need a language like Haskell for serious number crunching.
In our next video, we will be discussing how to work with SQLite3 and Haskell.
Note: This course was created by Packt Publishing. We are pleased to host this training in our library.
- Data ranges, means, and medians
- Standard deviation
- SQLite3 command line
- Slices of data
- Regular expressions
- Atoms and modifiers
- Character classes
- Line plots of a single variable
- Plotting a moving average
- Feature scaling
- Scatter plots
- Normal distribution
- Kernel density estimation (KDE)