There are many ways to examine data. Using Anscombe's Quartet, data visualization expert Matt Francis explains that sometimes using simple statistics can hide interesting features in the data that can only be seen by visualizing the data.
- Everybody's used to summary statistics, things like what's the average height of a class, what's the average sales per customer. These numbers can be a useful summary of a data set but they can also hide detail within the data. There is a danger that relying on summaries can lead to misleading or even incorrect answers. One of the best examples of the danger of relying on summaries is this, Anscombe's quartet. It's a group of four dataset, I, II, III and IV. Each one consists of eleven pairs of values, x and y.
Now what's important about these numbers is that when we apply summary statistics to them, we get very similar values. The average x value for each data set is nine. The average y is 7.5. The variance for x is 11; for y, it's 4.12. The correlation between x and y for each data set is 0.816 and if we'd applied a linear regression, that is, a line of best fit for each of the four sets, it would follow the equation y equals 0.5 x plus three, so we could conclude from that that these data sets are pretty similar.
However, what happens if we visualize this data? First, let's plot x-one and y-one from data set one. So we can see what appears to be a rough linear relationship with little variation to the line of best fit. Now this might be what we expected to see based on the summary statistics. So what about the second set? So wow, that's quite different. Even though the summary statistics were identical, the data points create a very neat curve that doesn't fit that linear relationship.
We wouldn't have expected that just from seeing those raw numbers, or, in fact, the summary data. Let's take a look at set three. Now, this has got a very tight linear relationship, and that angle of that straight line is pretty much bang on to that regression line with this one exception. We have this very large outlier. Again, we wouldn't have seen that without visualizing that data. Finally, let's look at set four. So it looks like this data, x remains constant except for this exception up here.
All the others perfectly fit that regression line. So there you have it: four sets of data which, according to their summary statistics, are the same. But when we visualize them, they are completely different. We wouldn't have got that from just looking at the table of data, nor would we have had that same result by performing summary statistics on those data sets. They gave the same result for all four. It's only by visualizing this data that we're able to see the true shape of the data.
Now the same can be true for any data set. Relying on summaries of tables or numbers hides the information within the data. We need to visualize it to fully explore, understand, and explain our data, and that is the power of data visualization. The start of any data visualization depends on the data set and what you want to know. Broadly speaking, there's five main questions you can ask. How does my one thing compare to another? How is this data related to that data? How is this data distributed? How is this data made up, and how does this data look on a map? So the first question to ask is: what type of answer do I want to find with my data? And then depending on the type of data you have, determine what's the best data vis for that combination.
Comparisons are where you want to see how one bit of data compares against another. So, for example, how do sales compare across regions? Or what are the top ten countries for life expectancy? The type of charts that are great for comparisons include bar charts for category data, time series charts for line data when you have a date component. Highlight tables show both details and summary in a single chart and let you compare multiple dimensions across a measure.
Now what all of these have in common is it's easy to compare one value against another, but you can also compare it to the overall data set. Relationships in data typically involve one or more measures and then you examine how the dimensions affect that relationship. For example, you might be comparing height and weight, and then seeing how that relationship changes across countries, age groups, or maybe gender. If you're looking for patterns and outliers in that data, there's a great way to do this: it is a scatter chart, where the position of data points relates to the measure values for the dimensions in the view.
Summary statistics don't always tell the whole story, and in those cases, you might need to look at the distribution of the data. For that, you're looking to see what kind of shape the data has. Is it clustered around a median, is it bi-modal, does it skew towards the higher or lower values? The most common mode of looking at distribution is a histogram, which is a great way of looking at the summary of the distribution. Another way is a box and whisker plot, which allows for comparisons of distributions within a dimension.
For example, you can look at the average sales across different product regions. Part to whole relationships, or compositions, is where you want to see just how much of the data belongs to a particular dimension. For example, what percentage of our sales were due to the customer section. Now the most common of these is a much derided pie chart, but there's also stacked area charts, bar charts, and tree maps. These are all great alternative ways of seeing the make up of the data.
It can be useful to map geographic data, as long as there is an actual geographic aspect to that data Now what I mean by this is you just don't map it because you can. If you're comparing states, for example, say average wages, then it might be better that that's best shown in a bar chart. However, if when you map that out, you find that one region of the country is much higher or lower than the other, that's interesting, there's a spacial element to that data. So you always want to experiment with the map, to make sure it is the best choice. Sometimes it is, sometimes it's not.
The final visualization that you build will be determined by a number of factors, but a good starting point is always to ask yourself: what kind of question do I want to ask? And then depending on the data, you can be sure to always make the right choice every time.
- Visualizing comparisons
- Building bar and line charts
- Creating tree maps for long-tail data
- Optimizing your dashboard layout with small multiples
- Visualizing data distributions
- Visualizing data composition
- Visualizing geographic data