Join Barton Poulson for an in-depth discussion in this video Exploratory statistics, part of Data Science Foundations: Fundamentals.
- [Voiceover] Once you've done exploratory graphics, the next step is exploratory statistics or numerical explorations of your data. The general principle, again, is this: First graphics, then numbers. With exploratory statistics, it's important to remember that you're still exploring; you're not modeling the data. Also, you tend to use empirical estimates, or estimates of the population based on your sample data as opposed to theoretical or analytical estimates. You can also check the effects of manipulating the data, and you can see how the data react to variation.
Now, there are three general categories of methods used in exploratory statistics. They are Robust statistics, resample the data or resampling statistics, and transform the data or transformations. Let's look at each of these in turn. First, Robust statistics. I have a mountain because they're not easily moved. They're robust. Robust statistics are stable in the presence of anomalies. They're less affected by outliers, skewness, kurtosis, and so on.
There's a lot of choices that can include, for instance, the trimmed mean, the median, the winsorized mean, the interquartile range, the median absolute deviation. I'm gonna add they're not always easy to do. Most statistical programs are not automatically set up for these. But there are packages that you can use, for instance, R or Python, that make it easier. Here's an example with some skewed data. What I have here is a data set where most of the values are really low and I've got outliers going way up high. And then what I'm going to do is take the trimmed mean and something called the winsorized mean.
A trimmed mean, you take a certain percentage of the data on the top of the bar and you just throw it away. The winsorized mean, you take that and replace it with the closest non-outlying value. And you can see how when we take the zero percent, which is really no adjustment, the overall mean is 1.24 The five percent, you see they start to go down. The trimmed mean goes down a little more. We throw away 10% on each end, and then 25%. And ultimately, the 50% it's the same thing as the median. It's the middle score, and in this case it's 1.01 You can make judgment calls about which version you think would be most informative in your particular analysis.
And you may choose to skip them entirely, but it's something to be aware of as an option. Next is resampling, or "bootstrap samples." These are empirical estimates of sampling variability, so instead of trying to get a population standard error, you get multiple standard deviations by sampling from your data and use that as an estimate of the variability. Some common examples include the jackknife, where you use random subsets of data or sampling without replacement. You can use the bootstrap, where you draw samples with replacement.
You can do permutation, where you shuffle cases across different groups. And also, the machine learning process of cross-validation is at least conceptually related to resampling as a way of checking the consistency of results. Next is transforming the data. Now, what you're doing here is you're looking for smooth functions that don't have big jumps in them that preserve the order of your data, and they allow you to use the full data set. A lot of times, transformations are used to fix skewed data or to fix a curved line in a scatter plot.
One common method of this is something called "Tukey's" for John Tukey's Ladder of Powers. And what you have here is, for instance, the third one down is X. That's the original value and it can go up by squaring or cubing it, or you can go down by taking the square root, or the logarithm, and so on. This is what it looks like when you take the data. The third one from the right is the original data set. It's mostly symmetrical. And you can see how squaring it pushes it down, you get some outliers on the top end, or cubing it does it more. And at the far end, if you take the reciprocal of the square root, you get a very different effect.
Now, you don't wanna do this to create outliers, but if you have outliers, you can use these to push it back into a more symmetrical and more normal shape that allows you to do an analysis with the complete data set. And so here are our conclusions about exploratory statistics. First, it's good to get multiple perspectives on the data. It's also good to check for stability under different circumstances. And finally, exploratory statistics sets the stage for modeling.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data