Join Barton Poulson for an in-depth discussion in this video Exploratory graphs, part of Data Science Foundations: Fundamentals.
- [Voiceover] Once you have your data, the first thing you need to do is see what's there. The best way to do that is with exploratory graphs. Specifically, you can get a feel for the data, you can check the assumptions for your planned analysis, you can check for anomalies, and you can see if the data suggests something new and unexpected to you. All of this together let's you know whether your planned analysis is appropriate, and whether you are going to be able to justifiably reach the kind of conclusions that you want.
Now, one question is: why start with graphics? It's because graphics because they are a visual, are information dense. They communicate so much more, so much faster, in a way that humans are very good at. Because humans are visual. Graphs are also the quickest way to check for shape of the distribution, gaps and outliers. All of which can have a significant impact on your analysis. Now in terms of exploration, there's a few things you want to do. You want to start with single distributions or univariate distributions, one variable at a time.
Then you look at joint distributions, or the associations between variables. You want to look for unusual cases, or exceptional values, as well as errors in the data. Another big one is missing data, where if a value is missing. Now sometimes it doesn't mean anything. It can be called missing completely at random, where you can basically ignore it. There's also something called missing at random, where you can account for the missingness with the observed variables. And there's missing not at random, or what's called non-ignorable, non-response.
And the person who invented exploratory data analysis, John Tukey, he simply did it all by hand. And it's something that can still be done as a way of getting a very personal feel for the data. Bar charts are a great first step. They're for categories, categorical variables. They're easy to read, especially if you put them in descending values. So the most common group is here off to the left, next to the axis and it goes down. Also, you can group them to look for associations between variables. Or another way to put it, differences between groups. Box plots, or box and whisker plots are also great for quantitative variables, so a measured variable.
These show the quartile values so the median, the first quartile, the minimum, the maximum, and outliers. That's one of their major purposes. Also, they can be grouped and you can even show several variables at once, as long as they're on the same or very similar scales. Next are histograms. These are also used like box plots for quantitative variables. They show the shape of the distribution. What's nice about them, is you can also overlay graphics like I've done here, to compare the shape to other possible forms.
Next, in terms of associations, one of the best things you can do is a scatter plot. Or in this case a scatter plot matrix, or matrices. This shows the association between several quantitative variables. This one also includes histograms for each of the three variables included. And I can tell you a matrix of scatter plots is much easier to read than a 3D scatter plot chart. So, as you're going through these various charts, you want to answer these questions. Do you have what you need to answer your questions? Are there clumps or gaps in the data? Are there exceptional cases? And are there errors in the data? The exploratory graphics are going to help you find good answers to each of these and prepare you for a more insightful analysis.
We can reach a few conclusions from this. First, exploration is always a critical first step in any good data analysis. Also, you want to use a method that is quick and easy. Use a tool that is well suited to it. Something you're comfortable with and that you can explore quickly. Finally, graphical exploration is a precursor to numerical exploration, which we'll talk about next.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data