Learn how to use the BigQuery machine learning SQL language extensions to build a logistic regression model.
- [Instructor] To round out our exploration of GCP Machine Learning APIs, we're gonna look at Something brand new, as of this recording. And this is an extension to Google's BigQuery data warehousing and sequel on text file service, and this is called BigQuery machine learning. So it's an extension to BigQuery for machine learning. As of this recording there are two types of machine learning models that you can create using sequel-like syntax.
These are linear regression, and an example of this is a Model can be used for predicting a numerical value, So how many of something is it likely that we'll sell in the future, based on our past sales history, or, binary logistic regression. These models can be used for predicting one of two classes or categories, such as identifying whether an email is spam or not. Now we can use the sequel extension inside of the BigQuery conSole, and set up this machine learning model, but I thought it might be fun to revisit working with the notebook paradigm.
So in the demo, we're going to work with the notebook and the BigQuery language. So here we are in the Google Cloud Platform, in our practice project, learning, and what I have done is I have uploaded a notebook into co labs, and you may remember co labs as the environment that we can execute different types of queries in a Jupiter notebook style format. So what I did is I said File, and Upload notebook.
So this notebook has a couple of different sections, the first thing is we have to set our project ID, and to get our project ID we're just going to copy it from right here, and we'll run this cell. Next, we'll have to authenticate this notebook as a client to our particular account. There are a couple steps to do that. First, we're going to call Google co lab and import the auth module, and then we're going to, against the auth module, call the authenticate user method.
Notice that I'm signed in to co labs with my Google account. What I need to do now is open a browser link, sign in, click allow, copy this token, and paste it into the verification. The next step is to create a data set in BigQuery.
So I'll just do that here. And you wanna use this name, and click Create Data Set. And there it is. And of course it's gonna be empty right now. So now, in this code section here, it has a number because I've executed it a couple of times and it's red 'cause I had some errors, and notice un-executed are gray. So the first line is BigQuery, percent percent, using the syntax and I've just put in my project number here and then we're creating a model in line two that is of type logistic regression in line three.
And we're using the big data public data from Google Analytics in line 11, and we're having a filter of just one year in line 13, 2016 to 2017. The logic in lines five through nine is to address getting rid of nulls. So this takes a couple of minutes, so I'm gonna go ahead and run this, and I'm gonna pause and come back once it's created. Now it's interesting, this is a really new service and although I did get this python error, I really think it has to do with optimization because after 229 seconds, it says it's timing out here, but actually if I look in BigQuery, it was created just fine.
So I expect this is just a bug and it'll be fixed. So let me go ahead and show you what that looks like. If I go over to BigQuery and I open up the model, you can see that I have my Model details, I have the training options, it's a Logistic regression model, and then I had 10 iterations where we had the Training Data Loss, the Evaluation, and the Learning rate progress over time, and this is how long each iteration took, and here is the Model schema.
So the next step in this is to look at the model statistics, and I did that in the UI, but I can also do that in getting the information from here. So let me go ahead and run that. And I basically get that same training information back. So now I'm gonna evaluate the model, I'm gonna look at the quality of the models, so let me do that. And you can see that the precision is point four seven, the recall is point 10, accuracy is pretty good, point nine eight, and the area under the curve is also point nine eight.
So now we wanna actually use our model to make predictions. We wanna predict the count of purchases for each country, and then we wanna do it by visitor ID. So you can see what's great about this is you just put a predict call in the middle of a sequel statement, it really is innovative in terms of its ability for you to use core types of machine learning on data that you already have stored in BigQuery.
So when you're running this, what you're doing is you're selecting the country and then adding as predicted label, the total predicted purchases from the model, and then you're dealing with the nulls again and then you're looking at the public information, grouping by country, ordering by predicted purchases, and putting the limit to 10 so it returns quickly. And you can see there it is, so US has got the most predicted purchases in this small subset, and then if we wanted to do the same thing but we wanted to do this by visitor ID, we would simply same query basically, run this, and we can see by visitor ID.
This is an really interesting extension to the capabilities of BigQuery that so many of us use so frequently around machine learning.
- Hosting options: Serverless, containers, and virtual machines
- Enabling the GCP ML AIs
- Preparing data with Cloud Dataflow and Dataprep
- Modeling predictions for images, video, text to speech, and cloud translation
- Machine learning with AutoML
- Advanced machine learning and deep learning
- Machine learning architectures