- [Instructor] So, this is obvious that you have to understand your data in order to visualize your data to some degree. So in this video, we're going to cover some basic statistics concepts and two key things to know about your data. Now let me start by saying you don't have to be a statistician or a total math weenie to do this work, but you do need to understand some very basic concepts to reduce mistakes, to increase accuracy, and to create more compelling content. So there are really three key concepts that I want to teach you today.
The first one, maybe you remember this from grade school or high school, is mean versus median. So, let's talk about these. So, here I have some statistics from hockey players. This is the total points scored by hockey players in a season on a particular team. And you'll notice they go up in chronological order, six, seven, 13, 17, dadadadadadada. I go all the way to the right-hand side, and you will see one kid scored 517 goals. Now, these stats are completely made up except for that last one, that's an actual real number.
The number of goals, or rather the number of points, Wayne Gretzky scored one year when he was in youth hockey, unbelievable. So this is a great example of when you might use median instead of mean. So, what are these two things? The mean is when you take all of values in the list, and you divide by the total number of items in the list. It's the truest statistical average. It's going to bring you the average of all those numbers added together. The median on the other hand is literally just taking the number in the middle of the list. When you have very consistent values, the mean is great because you're picking from a lot of numbers that are alike.
So you can find the real center or the real middle of those values using division, the actual average. But when you have outliers like in this case, Wayne Gretzky's youth hockey team, he scored 10 times as many points as the next highest point scorer. The median is better because it really gets you the more common middle, right? What's really the average for this particular group without the outliers. So here are the numbers. And this will help illustrate it. The mean, the actual statistical average is 57.
But look at the list. No one got 57 points except Wayne Gretzky. Everyone else was 50 or below. The median on the other hand is 25, and if you look at this list, and if you were to say the sort of average hockey player on this team got 25 points, it seems much more reasonable if it weren't for Wayne Gretzky skewing it. So that's mean versus median. So the next thing to talk about, is when to use actual numbers versus a ranked index versus percentiles.
These aren't your only choices, but they're pretty common ones to think about. So let's look at some real numbers. Here we're looking at GDP, the gross domestic product, the entire size of an economy. So this is the GDP for the United States on top at almost $16 trillion per year. And then at the bottom is Eritrea, which is about $3 billion per year. So what does this mean? How do we make sense of it? The numbers themselves aren't that helpful to me.
They're just big numbers. It's hard for me to wrap my head around them. But the ranking is helpful, right? United States is number one. It's the largest economy in the world. Eritrea is ranked 161st. Now, yet again, I'm still not entirely sure what that means. Are there 8,000 countries? Are there 200 countries? Are there 161 countries? What do these ranks actually mean? And so the percentiles take it an inch further than that. The United States is in the 100th percentile which means that 100% of countries are below the United States.
They have a lower GDP. Whereas Eritrea is in the 20th percentile, which means that 80% of countries have a higher GDP, and 20% of countries have a lower GDP than this country. It's a very helpful way of looking at these numbers. The next thing to think about is when to show change in numbers versus the actual numbers themselves. So let's look at GDP again. Here we have two numbers that are much closer to each other. We have the GDPs of Sudan on the top and Ethiopia below.
And so, these two countries are next to each other on the map. They have similar GDPs. But the question is, which country should I invest in? Where should I sell my goods and services? Based on these two actual numbers, I'm going to say Sudan. It's got a slightly larger economy, so maybe that's a better place to be. But if you look at the rate of change in GDP, Ethiopia is clearly a better place to put our dollars. So in this case, you want to look at the numbers. You want to look at the change, and it'll become obvious in many cases which one to focus on.
There are two more things to think about when you're thinking about how to understand your data again, in the most basic sense. The first one is sample size and methodology. So if someone is collecting data, and let's say they're doing a survey, it's really important to understand how many people did they ask these questions of, the sample size. What was the methodology? How did they phrase the questions? How were these questions asked? How was the data collected? It's all about the quality and reliability of the data. Now if you're not a data analyst, if you're not actually collecting the data, you're not necessarily responsible for all this stuff, but it may influence, A, the work that you do for sure, B, how you label your work, how you source your work and what you put in legends and in footnotes.
And finally, there's correlation versus causation. Now this one stumps a lot of people still to this day. There's a difference between the two. You can say two things are correlated, meaning that sometimes when one things goes up, the other thing goes up, or they seem to sort of move in the same direction at the same time. But that doesn't necessarily mean that one causes the other. This is a really important concept to understand and to make sure, once again, that you're labeling your data properly, and that you're not claiming something that isn't real and that isn't really there in the data.
So you need to know your dataset, specifically. Now, again, you may not be the data expert on your project. You may be able to lean on others for deep expertise to help you with these things, but you do still need to know enough about your data to work with it. So first of all, you can think about what's the headline? What's the headline of this data? How do I sum up the thesis in one or two main ideas? And so, if you figure that out first, then that's a great start. Now, often there's more than one headline and more than one way of presenting the headline.
So I was creating visualization, looking at partisanship in Congress, and the first thing I did was to think about the headlines. But one important thing I always do is I think about the headlines without the answers filled in because I don't want to introduce bias into my concepts. So for instance, I might start off by saying, the blank party is more or less partisan than the blank party. I know that that's sort of the headline that I want to get to, but I'm not going to fill in the answers because I don't want to assume anything. I want to look at the data for the actual answer.
This the curiosity that started the investigation. I also wanted to know blank is the most partisan member of Congress. And I wanted to know, the average Congressperson votes with his or her party blank percent of the time, or that the State of blank is the most partisan in the country. As I looked at the data, I added more headlines to drive functionality in the experience that I was creating. Now this is a great lesson for this course. You really need to know your data well enough, to spot mistakes.
No one else is going to do it for you, oftentimes. So, one example, I was looking through my data, and I found that people voted with their party only 20% of the time, 10 to 20% of the time. And that's wrong. Like, I know Congress well enough to know that that's silly. Of course, people vote with their party far more than that. So, I dug through the raw data. I dug through the code, and I found just one piece of backward logic. And of course, as soon as I switched that code, I got to the real numbers, that people vote with their parties 80 to 90% of the time.
The mean, the average, is 92%. Now if I hadn't known my data, if I hadn't known how Congress tends to behave, and I couldn't dig into the data enough to confirm that, I might have missed this very, very costly error. You also have to know your data enough to make sure that you're avoiding bias. You have to understand it and predict the bias and be able to dig through the data to find it so that you can avoid it. So for instance, if I knew that I was looking for something specific. Like, I knew which party I expected to be more partisan, then I might be tempted to look at the data with that bias in mind, and therefore, odds are you're going to find what you're looking for.
You also have to be open to the data being different from what you expect. You can't help but having some bias, but you can actively look for different stories in your data. Try to find holes in your theories. Try to disprove your hypothesis to avoid bias. Check your work, but be ready to accept what the data is telling you. One example, I had expected that the partisanship rates would have varied more between the parties. I hadn't decided which I thought would end up being more partisan, but I thought there would be a lot more variability between the two.
But as you can look at the bar charts on the left, they are nearly identical. Percentage-wise, they're pretty much the same. I had also thought that certain states would be more partisan than others, but again, my bias was proven wrong. The parties are nearly identical. The states were different. The list of states that were more partisan were different than the list that I would have guessed. So I know my data enough to go in and dig around make sure that my hypothesis that had been disproven in both cases was in fact incorrect.
So again, know your data. Know some very basic math skills. Know your data enough to spot the mistakes and avoid bias, and you'll stay on track.
- Describe the process by which individuals’ interests are incorporated into data visualizations.
- Differentiate the use of the Ws in data visualization.
- Explain techniques involved in defining your narrative when visualizing data.
- Identify the factors that make data visualizations relatable to an audience’s interests and needs.
- Review the appropriate use of charts in data visualizations.
- Define the process involved in applying interactivity to data visualizations.