Join Joseph Lowery for an in-depth discussion in this video Loading data, part of Up and Running with Google Cloud Platform (2013).
In this lesson, we'll create a data set to be used by BigQuery. I've compiled a list of the most popular baby names as recorded by the US government from 2002 to 2012 for our demo. Let's start by looking at the CSV file in Excel. You can find the file in the Chapter 3 0302 Assets folder of the exercise files. It's called babynames.csv. As you can see, it's a pretty simple file. There are four columns, named, gender, frequency, and year.
If you look down at the bottom of the screen, you can see that there are 366,322 Rose, maybe not massive, but pretty darn big. Okay, let's switch to the Google Cloud Console. And, click on our Project Name. Now, click on BigQuery. On the BigQuery page, click on the down arrow next to the project name on the left, and choose Create new dataset. First, we'll give the data set an ID. Use only numbers, letters, and underscores for this.
I'm going to enter baby_names_lynda, and then click OK. You'll see the data set appear under the project name. Hover over the name until you see a plus symbol and a down triangle. Click the plus to bring up the Create and Import dialogue box. Now, we'll add a table ID. Again, only letters, numbers, and underscores. I'll enter bn_2002_2012 for baby names from 2002 to 2012.
When you have that, click Next. Next up, we choose the source of the data. Ours is a CSV file, which happens to be the default. We also have the option of importing JSON file or Using an App Engine data store backup. Let me click Choose File and I'll navigate to my Exercise Files. I'll go into Chapter 3, 0302 > Assets, and then choose baby_names.csv. If you want to look at the entire Zip of baby names, they're organized by year and they go back to 1880. Those are also included for your own use and play. Be warned though, you can waste many hours of precious time investigating your own birth name and those of others.
All right, I have my file ready to upload. Now, I'll click Next. And now, it's time to define the schema. BigQuery needs to know the name of each column and its data type. The syntax is column name, colon, data type, comma, and then the next column name and data type. So, I'll enter the name of the first column, which is name, a colon, and then the data type. And this is a string. Enter in a comma and we can enter in the next name value pair.
Next column name is gender, colon, this is also a string. Follow that with a comma, and then frequency, the number of times that that baby name appeared in that year, colon this is an integer, comma. And then, our final column, year, colon. And you could set this up, either as an integer, or a string. I'm going to put it in as an integer. And once you have that, click Next. The final panel allows us to define the field delimiter. Which in our case is a comma, the default option.
We can also include the number of header rows to skip. For us, that's going to be 1. I don't want to allow any errors, so I'm going to leave the number of errors allowed at 0. I'm ready to hit Submit. You'll see the table name appear under the dataset. And in a few moments, it will show a loading message. When it's finished loading, if all went well, you'll see a success message. Depending on the size of the file and your internet connection, it could take a moment or two. If the file upload completes, and you don't see any success message, click on Job History, and that will verify that the load has been completed. If there was a problem, you'd see an error message here. Okay.
It looks like we're in. Our data set is established. And table loaded with ten years of baby names. We're ready to analyze our data, which we'll do in the next lesson.
- Why Google Cloud Platform?
- Deploying an app with Google App Engine
- Activating and working with Google Cloud Storage
- Loading, querying, and exporting data with BigQuery
- Working with Cloud Storage buckets
- Managing cloud-based private networks
- Importing and exporting data
- Scheduling backups
- Working with Google Datastore