Learn about engineering new features from existing data.
- [Instructor] When using supervised learning to solve a problem, we feed features into a machine learning algorithm, and the algorithm learns how to predict the correct output based on those features. This is a simple idea, but in real-life applications it can take a lot of trial and error to figure out which features are most useful for modeling the problem. Feature engineering is using our knowledge of the problem to choose features or create new features that allow machine learning algorithms to work more accurately. Feature engineering will consume the majority of your time when you are building supervised learning systems.
Doing a good job of feature engineering will make a large difference in the quality of your model. To get the best result possible when training a machine learning algorithm, we want to make the problem as simple as possible for the algorithm to model. That means we want to feed in features that correlate strongly with the output value. In fact, including useless features can harm the accuracy of our system. Let's look at an example. Here's a table listing the size of each house, and the number of potted plants in those houses. It's likely that the size of the house helps determine the value of the house. We know that based on our domain knowledge of the housing market.
So that would be a good feature to include in our model. But it's also likely that the number of potted plants doesn't impact the final value of the house. Potted plants are just decoration, they're inexpensive and they're easy to add and remove. Very few people are going to base their home purchase decisions on the number of potted plants in the house. In a case like this where a feature seems like it's introducing random noise into the model, and not telling us anything about the actual value we're trying to predict, we should just remove the feature from our model. This is an example of feature engineering. Let's look at some different strategies for feature engineering.
So far, we've talked about adding and dropping features from our model as a way to improve accuracy. Deciding which features to include or exclude from our model is the simplest form of feature engineering. We can also combine multiple features into a single feature. The goal is to represent the data in the simplest way possible. Let's look at an example. Here's a table showing the height of each house in feet and inches. Height could be a very useful measurement in determining the value of a house, but feeding in a separate number for feet and inches will make the model more complicated than is necessary.
By feeding in this single measurement as two separate features, the algorithm has to figure out that those two numbers are related and part of a single measurement. It's much better if we pre-process our data and replace each height measurement with a single measurement in a single unit. We can just convert the measurements to inches. Another feature engineering strategy is binning. Binning is where you take a numerical measurement and convert it into a category. Let's look at an example. Here we have a table that lists houses and the lengths of their swimming pools. It could be that the exact size of the pool doesn't matter nearly as much whether or not the house has a pool.
Some buyers want houses with pools and some don't. So in this case we can pre-process the data and replace the numeric feature pool size with a true/false feature. This simplifies the model. Our final strategy for feature engineering is called one-hot encoding. When we have categorical data in our model, we have to process it before we can use it. One-hot encoding is a way for us to represent categorical data in a way that the machine learning model can understand. Let's look at an example. In our housing data set, an example of categorical data is the neighborhood name.
We can't feed a neighborhood name directly into our model because it's a string of text and not a number. Instead, we need a way to represent each neighborhood as a number. The simplest solution would be to assign a number to each neighborhood. But this doesn't work very well with some machine learning algorithms. The problem is that the machine learning algorithm will think that the order of those numbers is significant. It will assume that bigger numbers are more important than smaller numbers. But Neighborhood ID 2 isn't twice as important as Neighborhood ID 1. The order of those numbers is actually meaningless. This solution is to use a different representation called one-hot encoding.
Let's look at the original list of houses again. Across all three houses, there are two different neighborhoods, Normaltown and Skid Row. In one-hot encoding, we create a new feature in our data set for each unique category in the categorical data. Here we've created one feature for Is_Normaltown and one feature for Is_Skidrow. Then we'll set each of those to one or zero, depending on if the house is in that neighborhood. This is called a one-hot encoding because exactly one of those values will be one, or hot, for each house. One-hot encoding is useful for replacing categorical data with simple numeric data that the machine learning model can easily understand.
- Setting up the development environment
- Building a simple home value estimator
- Finding the best weights automatically
- Working with large data sets efficiently
- Training a supervised machine learning model
- Exploring a home value data set
- Deciding how much data is needed
- Preparing the features
- Training the value estimator
- Measuring accuracy with mean absolute error
- Improving a system
- Using the machine learning model to make predictions