Join Barton Poulson for an in-depth discussion in this video Regression analysis data, part of Data Science Foundations: Data Mining.
- The algorithm that you choose, the method that you use for measuring the association between variables can affect the meaning and interpretation of the results substantially. And there are two general classes. One is classical methods, or algorithms for regression. These are methods that are based on means, or averages, and squared deviations from predicted values. There are also a very broad category of, what you can call, modern methods. These are alternative methods for calculating distance and for choosing between predictors that may be correlated with each other.
In terms of classical methods, there is simultaneous entry, where you simply take a whole bunch of variables and you throw them in there all at once and see how they work together as an ensemble. You could also do blocked entry, where you choose a group of variables, put them in, then add a second group, and then a third group. Or there's stepwise. This is an automated procedure whereby the computer chooses the one variable that has the highest correlation with the outcome, puts that in. Then there's what are called partial correlations, where the computer chooses the one that is the highest, puts that in, and so forth.
It sounds like a nice way to do things, kind of hand it over to the data. But stepwise entry, in many situations, is massively prone to overfitting and getting models that only fit with those exact data and capitalizing on chance. That's a problem and so most people don't recommend stepwise entry. In fact, they recommend against it rather strongly. There are also non-linear methods. So if you have a curvilinear relationship, even within classical methods, there are ways for dealing with that, usually by transforming a variable or getting a power of the variable.
In the class of modern methods, there's LASSO regression, which stands for Least Absolute Shrinkage and Selection Operator. It's a nice way of doing something similar to stepwise regression, but without the risk of overfitting and the breakdown in generalization. There's also Least Angle Regression, which is related in some ways. There's RFE, which is Recursive Feature Elimination, sort of like a stepwise procedure but it's actually in a class of embedded methods, and it's used often with support vector machines for machine learning.
And on the same topic of machine learning, very similar to what's called a support vector machine, or an SVM, there is a support vector regressioner, or SVR. It uses very advanced, high-dimensional calculations based on what's called the kernel trick, to find a hyper-plane, sort of a flat plane, that can separate the data and predict values very cleanly. On the other hand, RFE and especially support vector regression can be very hard to interpret. In fact, when you're looking at these various methods, there's a few things you want to think about.
Number one is: how well can this method explain the current data? How well can it model the association between the predictors I have in front of me and the outcome I have in front of me? Some are better at that than others. However, they might do it by overfitting, which is a real problem. And that gets us to the next one: How well does each method generalize to new data? And what you find is that the modern methods are usually much better suited towards generalization problems. They often have cross-validation built in as a way of checking the assumptions of the original model.
Now there's the issue of ease of calculation, in that a lot of the classical methods were actually built to be calculated by hand. And that does make them easier to explain and easier to demonstrate, but given that nobody does this stuff by hand, everything's done by computers and our computers get faster and faster and faster over time, that's essentially a non-issue. On the other hand, there is the issue of the ease of interpretation. Can you explain what it all means? That might be really important. And then, perhaps, ultimately, the ease of application.
Can you take the results you get from your model and do something useful with them? For many people, nothing else there matters as well as does it apply to new data and can I use it to generate new insights with new data? There's a wide selection of both classical and modern algorithms, with different strengths to each of them. The problem is some of these methods, especially the classical ones, are prone to overfitting and they have problems with generalization. Then again, probably what's even more important is the ability to both interpret and apply the results of what you're doing in a useful situation to get you some extra insight into what's going on with your data.
Barton Poulson covers data sources and types, the languages and software used in data mining (including R and Python), and specific task-based lessons that help you practice the most common data-mining techniques: text mining, data clustering, association analysis, and more. This course is an absolute necessity for those interested in joining the data science workforce, and for those who need to obtain more experience in data mining.
- Prerequisites for data mining
- Data mining using R, Python, Orange, and RapidMiner
- Data reduction
- Data clustering
- Anomaly detection
- Association analysis
- Regression analysis
- Sequence mining
- Text mining