Join Barton Poulson for an in-depth discussion in this video Problems, part of Data Science Foundations: Fundamentals.
- [Voiceover] Let's take a look at some of the common problems in modeling, that is when you're trying to build statistical models. The most significant is that data science can get complicated. For instance, you might have non-normality in your data. Most approaches like bell curves, and if you have something else, you got a problem. You might have nonlinearity in the association between variables. Straight lines are easier to deal with, curves and other shapes are more difficult. You might have multicollinearity, which I've mentioned previously. This is when your predictors are associated with each other, there's overlap in their predictions with the outcome variable.
And you might have missing data, which introduces a host of its own issues. Let's look at each one separately. Non-normality is a problem because most approaches like bell curves. Here's a nice, normal bell curve distribution. But you might have a strongly skewed distribution. Or you might have micked distributions that are bimodal or have other shapes. Also, you could have outliers. And what you find is that non-normality distorts the measures and the models that you use.
Now, there's a few things that you can do. You can try transforming the data, where you take an asymmetrical distribution and you try pushing it around a little bit. You can also try to separate mixed distributions. And then hopefully you have two normal distributions to work with. Nonlinearity is when you have a curved association. Most approaches, like regression, like to draw straight lines, which is what we have here. On the other hand, we have a very strong curve shape, in fact a perfect curve shape in the data.
Linearity's a common assumption of methods. One of the things that you can do is transform a variable, either one of them or both of them, and see if that straightens things out. Also, if you have a growth curve, you can have a polynomial thing where you can take the square, or the cube, or some other function that helps straighten out that particular curve. For multicollinearity, which I've addressed previously, that's the problem of having correlated predictors that overlap each other in trying to explain the outcome variable.
And the problem is that this can distort the coefficients. Now, some procedures are less affected by this than others. But one interesting method is that sometimes just using fewer variables, you can get a model that is still very good in predicting. But more significantly, is stable and interpretable. Also, not every decision has to be based on the data necessarily. You can use theory to choose between the possible collections of variables.
That's one of the reasons that data science is constituted of computer programming, and math and statistics, and substantive expertise, because then you have theory to help guide your analyses. Two other problems that come up, one I've mentioned before, is the Combinatorial Explosion. This is when the combination of variables or categories grows too fast. So for instance, if you have four variables with two categories each, there's only 16 combinations. You can explore all of those, that's not a big deal.
But if you have 20 variables with five categories each, and that's very easy to do. You actually have 95 trillion combinations. Obviously, you can't cycle through all of them. And so you have to find a way to prune things down a little bit. Theory, again, based on substantive expertise, can be a really great way of guiding you through here. Also, what are called Markov chain Monte Carlo methods, MCMC, are a common approach here.
Now, it's a whole other discussion to talk about how these work out specifically, but it can be a solution to when you have intractable problems where you can't go through every possible combination. There's also something called the Curse of Dimensionality. And this is when a phenomenon that you're looking at only occurs in combinations of many dimensions, it's a high dimensional phenomenon. Again, if that becomes difficult to explain and becomes difficult to predict, you may want to try to reduce dimensionality, and like the Combinatorial Explosion, Markov chain Monte Carlo methods can be helpful here.
Missing data can distort analyses and can actually create bias in your results. One of the things you need to do is to check for patterns in missingness. Is there a particular group of cases that are missing values on this variable? If there are, then you need to try to understand why they might be missing for those cases, but not for others. Now, if there's no real pattern, you can simply delete the missing cases, but that's unusual. Another approach is you can impute the missing values. And there are several methods for this, some of which work much better than others.
From this brief discussion, we can reach a couple of tentative conclusions. Number one, not surprisingly, data can introduce complications when you're trying to do your modeling. Those can include things like ambiguity in the responses, ambiguity in the models, bias due to missing data, violated assumptions do to non-normality. And you can uses analytical methods, but you can also use the substantive theory of your domain to resolve some of these complications.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data