See how data is prepared for machine learning using Cloud Dataflow and Cloud Dataprep.
- [Instructor] Now although in this course we're going to really be conducting a survey of the GCP machine learning services that are available, there is another important step in actually getting business value out of any of these services, and that is putting whatever service or whatever model, whether you use Google's machine learning model or create your own, into a software development lifecycle. So this is no different than any other type of working code, and it is important to think about this when you're moving from studying to using your models to provide value in your products for your customers.
So there's three areas to think about here. First you need example data, and I've been talking about this with my photos example in earlier movies, where we're having photos of animals and we're needing our model to classify what animal is in the photo, and you can see that there's three steps to this. Fetching, cleaning, and preparing. So cleaning is getting rid of bad data, missing data, maybe translating data values that are invalid. Preparing is very different in the case of machine learning, the majority of the machine learning that I work on is supervised.
So we have this need to label the data, and this is something that I find in most machine learning courses is just glossed over. So I'm going to show you in this movie, some tools that are part of the GCP Suite that I use to help me automate some of the preparation steps. So I actually use unsupervised machine learning, and also some other tools to clean and prepare and sort data, and some times I use it to try to automatically label data.
Now there's usually a human aspect to labeling, in addition to an automated aspect, and again that could be a subject of a whole course, but I just wanted to preview some of these tools because subsequent to this movie we're going to start with very clean and pre-labeled data sets and that just does not reflect the real world. So I just wanted to cover that aspect. Once you have your data, then you're going to select your model which can be provided by Google in the case of the API, such as Vision, or you select the model type. So for example, logistic regression if you want to create a classifier.
You're going to then provide your model with training data, train your model, and importantly tune the parameters, which are often called hyperparameters, and these would be values like how many, in this case, images in a batch, or what is the tolerance for likelihood that you want to say is important for your business case. Does it have to be a probability of 70% of a category, does it have to be 95% of a category? How much error can you tolerate, and this goes to the evaluation metrics around your model.
Once you meet your metrics and of course, a tricky part of this is defining them, particularly if machine learning is new, and I'm, again notes from the real world, often working on that with my customers, and then you deploy to production, so you need to have some sort of a hosting at scale, and this again can be very challenging. For example I'm working with a Bioinformatics use case now, where we built a model that ran successfully on a Docker container, but the challenging part is figuring out how to size for the amount of workloads, which are massive, given the size of the genome, in production situations, and so we're in the deploy the model phase, and we've been there for a little bit of time, and then we wanna monitor and collect data.
So to that end, some of the tools that I use in the GCP Suite, in this modeled lifecycle, they have the word data in their title. So, Dataprep, Dataflow, and Dataproc, and again these are very powerful tools that there's some great information on the Google site about, but I just wanna show you. I've run some examples to get your interest in these tools, because preparing the data is a important part of successful machine learning. So dataprep is managed extract, transform, and load, or ETL, or it could be extract, load, and transform, the other way, container clusters with a GUI interface.
Dataflow is managed ETL container clusters, and Dataproc is managed Hadoop and Spark clusters. So back in our sample projects if I click on the menu here, if I scroll down, you can see that these tools are grouped together under the big data category. So we have Dataproc, Dataflow, and Dataprep. So I like to start with Dataprep, when you look at Dataprep it looks kind of like a reporting tool, but I have used it very successfully as a data preparation or cleansing tool.
So what I did is I preloaded some data, I ran a sample, and it's really most easily, intuitively understood by looking at it. Underneath the hood, if you're making transformations, it will generate Dataflow jobs, and so I've also prerun a Dataflow or conversion job, and we'll look at both of those now. So here's some results from a sample in Cloud Dataprep and you can see inside of here, that you have some great information about the validity of your data. So we've 97% valid, zero mismatched, 3% missing, how many columns, how many rows, and we had run a job basically, it took 23 minutes, and the CSV file.
We can look at the data sources, and this is called a flow. So inside of here you can see on the various columns, great information that can help you prepare data for machine learning. In this case we've got some IDs, we got candidate names, party affiliations, total contributions, this is political data, average contributions, and I can drill and look at any of this data with more detail. It's a fantastic interface, that you can then manipulate the data, you can remove data, and you can work side-by-side with your business subject matter experts, because this interface is designed to be presented to people who are focusing on the data and not the technology.
Now as I said, when you go through and you manipulate these results and you say okay, I wanna get rid of one of these values or something in this interface, then that will generate a job, and the job is run through Dataprep, and let me show you what a job looks like. So inside of here, I just basically used a template so if you wanna explore this, you can pick these templates and this will help you move data between the different stores and it also has data transformation, so it's a great way to explore. So inside of here, I ran the Dataprep getting started, which is on the election data, and then I just manipulated some of the data inside of here and created a new job.
So what happened here is a distributed job that ran successfully, and you can see it automatically spun up the resources that were needed, they're called pools, and if I scroll down, you can see it's quite a complex job, and it is encapsulating for example, this step was adjoin and it took 11 minutes, and it's encapsulating distributed mappers and group by, this took 10 minutes, and inside of here in the group by we had unioning, flattening, so on and so forth, and I can drill into the individual step and I can see how much time each of the steps took.
Again, this tool is so powerful I could do an entire course on it, it's a tool that I use with data preparation, for regular business reporting and also for machine learning. So I wanted to at least make you aware of it, so again it's on the menu down here, and it is the big data section here. So Dataproc, Dataflow, and Dataprep, three super useful services in getting your data ready on machine learning on the Google Cloud.
- Hosting options: Serverless, containers, and virtual machines
- Enabling the GCP ML AIs
- Preparing data with Cloud Dataflow and Dataprep
- Modeling predictions for images, video, text to speech, and cloud translation
- Machine learning with AutoML
- Advanced machine learning and deep learning
- Machine learning architectures