Join Barton Poulson for an in-depth discussion in this video Data for data reduction, part of Data Science Foundations: Data Mining.
- [Narrator] One of the big decisions you have to make with data reduction are the algorithms that you're going to use for this procedure. There are two general categories of choices. The first is linear methods, and those draw straight lines through the data and they use relatively simple linear equations. And then, there are also nonlinear methods, and these are generally for high-dimensional manifolds, and they use complex equations. I'll explain that in just a moment. But let's start with the linear methods. Let's say we've got a little scatter plot of data here that's on two dimensions, x and y.
Now, the most common linear method is principal component analysis or PCA. And PCA is designed to reduce the number of variables, reduce the number of dimensions. And it does this by trying to maximize the variability of the data in lower-dimensional spaces into which it's projected. Now, if you look at the scatter plot, one of the things we can do here is we can run our aggression line through it. And then, we can find the distance of each point from the regression line in a perpendicular segment, not the vertical segments that you use for regression residuals, but perpendicular.
And then, the idea here is that, well, we've got a lot of variability on both x and on y, goes from zero to 100 pretty much on both. If we then rotate it on the regression line, and we flatten it out but maintain the distances, we still have a lot of variability on x, the first dimension going left to right, but we have a much smaller amount of variability on y. Essentially, we're collapsing that dimension by combining the diagonal effect of the other two. Now, a couple of considerations about principal component analysis are number one, are you going to do is called rotation.
And the idea here is that you get multiple components from your solution. It actually gives you as many components as there are dimensions, but some of them are going to be more important, or account for more of the variability than the others. You can take these solutions and you can actually rotate them. There's a geometric analogy for it here, but the idea here is you twist the axis a little bit to make them easier to interpret. Now, there's also the related procedure of factor analysis. It's closely related and in fact, it gets confused in some programs like SPSS.
The command factor analysis uses principal components as its default method. But factor analysis is based on a different theory. The idea here is that the underlying factors come first and then the variables or manifestations, whereas principal components just deals with the observed variables and their observed variance. Now, the third issue, probably the most important here, is the issue of interpretability. For human use, that is when a person's gonna be looking at the results and using them to make decisions about things, the ability to interpret the results of the dimensionality, or data reduction, is critical.
That's less so for machine learning, and that gets us into our next topic. And that is nonlinear methods for data reduction. Specifically, nonlinear methods for dimensionality reduction. Now, these are really useful for when you have what's called a nonlinear manifold. I used that term earlier. Let me give you an idea of how this works. This is a capital i, now essentially, it's supposed to be a one-dimensional straight line that has length and no width and no thickness. It's got thickness here so you can see it, but it's one-dimensional, and that is a very simple thing to deal with.
If you take a capital s, on the other hand, it's still a straight line, but it's curved. So, it's fundamentally a one-dimensional drawing, but now it has both height and width. And so, you now have a one-dimensional shape that's kind of been placed in a higher dimensional. That makes the s a manifold because it's embedded in a higher dimensional space. Nonlinear methods are very common in topics like computer vision and a number of other machine learning topics.
The trick is they're pretty sophisticated to do, and the results are difficult to interpret. You can get the relative importance or contribution of variables, but it's hard to the the actual weights or be able to interpret how each one contributes. In terms of nonlinear dimensionality reduction, you have a few choices. For instance, there is a variation of principal component analysis called Kernel PCA that uses the so-called Kernel trick as a way of analyzing high-dimensionality data.
And there's some methods of getting the Kernel that are related to this. There's Isomaps, there's something called local linear embedding, and there's a lot of other variations. A more interesting one that has a sort of semi-flexible Kernel is Maximum variance unfolding. There's probably 20 or 30 different choices. They can work in different circumstances, but the trick is they're all pretty complicated, and it's hard to interpret the results. If you're using a straight black box method, doesn't matter, but if a human's involved, you're gonna wanna emphasize interpretability.
When you're doing data reduction, or dimensionality reduction, there are many different algorithms available, going from the linear and the nonlinear into versions of each. Principal component analysis is probably the most common overall, and it's really easy to interpret. And you'll find that interpretability, and simply the need for interpretability, is going to have a strong influence on informing your choice among methods for data or dimensionality reduction.
Barton Poulson covers data sources and types, the languages and software used in data mining (including R and Python), and specific task-based lessons that help you practice the most common data-mining techniques: text mining, data clustering, association analysis, and more. This course is an absolute necessity for those interested in joining the data science workforce, and for those who need to obtain more experience in data mining.
- Prerequisites for data mining
- Data mining using R, Python, Orange, and RapidMiner
- Data reduction
- Data clustering
- Anomaly detection
- Association analysis
- Regression analysis
- Sequence mining
- Text mining