Join Bill Shander for an in-depth discussion in this video Convert your data (the inevitable step you should always anticipate), part of Learning Data Visualization.
I think we can all agree that the world is not perfect. Your prom may not have lived up to your expectations. Your boss may not praise you frequency or enthusiastically enough, and sometimes maybe you get a flat tire even when it's raining out. But, I think overall, life works out just the way you needed to, most of the time, right? But, data visualization is going to be a huge disappointment to you in some ways, and that's in that the data almost never comes in a form that you need it in.
So even when you get data from your clients and they've already parsed it and filtered it, and deduped it and cleaned it, the fact of the matter is, it almost never comes in the form you need it in. So the fact is that it's part of your job when you're doing data visualization, it's a big part of the job in fact, to clean your data and to convert your data into forms you can use to do the job you need to do. In this movie, we're going to talk through a few of the ways that I find myself doing most commonly, you know, adjusting and converting and cleaning my data.
One of them is calculating indexes and ratios, which is a very core concept to understand. Another one is percentiles, aggregating data, and also regrouping data from the way you get it. And finally, converting from one form to another. For instance, from Excel or comma-delimited files into JSON or XML or MySQL, which are very common forms used for web programming. First, we're going to talk about calculating indexes and ratios. Most of the time, we find ourselves comparing apples and oranges data.
Calculating indexes and ratios are one way of taking data and converting it in a way that makes it easy to look at data side-by-side that isn't necessarily easy to do in the original form that it comes in. So what I have here is a bunch of data from the Bureau of Labor and Statistics. And starting off with the minimum wage, and so I have the minimum wage here from 1980 through 2013. And I also have some other data here, right, like the price of gas. How much do a gallon of gas cost in 1980 versus 2013? Or a dozen eggs or a loaf of bread or electricity.
So I'm just going to take these real quick and just generate a couple of line charts. So if I take the minimum wage, and I select the entire row and I generate a line chart. And you'll see more about how to generate line charts and other types of charts in the next movie. I can see that minimum wage has just pretty much gone up and up and up over the years, right? Sort of goes up, Congress doesn't change it for a while, goes up again, doesn't change et cetera. So there is an interesting looking graph already that is has the stepped form. Now if I do the same thing, let say with the price of gas, and generate a line chart.
I can see price going up and up and up. So, I can already get the sense especially if I start doing more of these that okay, from 1980 to now, the numbers have just gone up. And so, I might ask myself, well, what does that really mean? Does that help me at all understand minimum wage, and how it's changed, and what impact it has on real people's lives? Not so much. What I really need to do, is start to look at comparisons and ratios of that data. What I've created here, is a ratio of wages to gas. So I've taken the value of minimum wage, and divided it by the price of gas, this is for 1980, and I get the number of gallons of gasoline I could buy on one hour's work of minimum wage in 1980.
Done the same thing for all the other data points here, and again for every year. And so if I chart that, instead of the numbers for the price of gas or minimum wage or these other things going up and up and up, I can see how the go up and down. So for instance, the price of gas, I could buy for a lot of years more and more and more gas over those years. And I'm guessing this is probably maybe the 1970s where, you know, my ability to buy gas on minimum wage maybe started to come down and down and down. And then fiscal crisis 2008 went back up again, and then down, and down, and down, and down.
Price of bread, I could buy sort of generally over all less, and less, and less bread over the years with minimum wage. So in addition to basic ratios, there's also indexes, and indexes are really interesting. First I'll explain how you create an index, and then I'll explain why you might want to do that. If you take the value of the ratio, so here I have how much gas I can buy on minimum wage. Now I take that number B14, and then I divide it by the maximum value of the entire range. So I want to find highest value, and by the way Excel has a nice function for that max, so max value of this entire row of data, divide this by this, and I get an index.
So I'm sort of dividing the numbers by themselves, and I can find for instance, with the price of gas, the strongest year, the year that I could buy the most gas on minimum wage, is always going to be indicated by the one at the max value, was in 1999. Sure enough I look at the chart here in 1999, is where the best year was. The worst year, of course was right around here or maybe here. Then sure enough, if I look at the index values, I see much lower numbers down here than I see the beginning. So why would I do this? If you think about it, there is a couple of reasons but the biggest reason is that, I'm comparing things that have different values.
So if I look at electricity at 58 versus gas at a ratio of 2.79, I can't compare them next to each other because those numbers have no relation to each other. If I do an index, they're all related to the numbers, somewhere between zero and one. I'll show why. If I do a chart, of all of these ratios, and I put it on a single line chart, I'm going to get a chart that has all these lines, most of which are bunched here down at the bottom because they're all closer in value, right. The twos, and the six and three and the zeroes, they're kind of near each other here at the bottom.
But then I have electricity which is way out here. It's just on a different scale. So I can't make any sense of this. These are apples to oranges. But if I look at my indexes, and do the same thing, generate a line chart, now they're all in scale within each other. And now I can see them right next to each other. That the best year for bread was over here, and then it comes down and down and down. Electricity is here in purple. It went down and then up, and then down and then up, and then back down a little bit. Much easier to see and compare to each other because they're based on indices.
So the next task we're going to do is, calculating percentiles, which is something I do quite a lot when I'm converting data. And the main reason is that, it makes it a lot easier to compare numbers to each other as part of a whole. And I'll show you in the second example for why you might want to do that. So here we're looking at GDP data from every country in the world. So GDP means gross domestic product, and it essentially is the entire size of the economy. And the United States is of course the largest economy in the world at 16 point something trillion dollars per year.
And these are in rank order. And so you can see that I've also pulled out Germany here, which is 3.4 something trillion dollars a year. And they're ranked number four, right? You can see it here in the list. And Sudan at 58 something billion dollars per year, which is ranked down here at number 68. And so, I can look at these two numbers and I can compare them to each other and say, okay, 3.4 trillion is a lot bigger than 58 billion, but I don't know exactly what that means in relationship to all the other countries in the world, which is probably how it's going to be most interesting to me.
So, what I might want to do is look at them in rank order, and so again, four compared to number 68. That tells me something interesting, but again humans have a much easier time parsing numbers not just sort of random numbers, how does four compare to 68? But more importantly, how does four and 68, how do they compare to the entire list? So the entire list is 181 countries in total. So I want to know really where they are in relationship to the entire list. And that's where percentiles come in. So I'm going to calculate the percentile, which is going to tell me what percentage of countries fall above or below each value.
And I'll show you how to do it. So I'm going to take a formula and a formula is this. I take the number one, and then subtract the rank order, which is in this column here, divided by the total number of items. And so I can see that the United States is in the 99 point something percentile, meaning that 99% of the rest of the countries are below the United States in this rank list. Now I'm going to take this formula and I'm going to copy and paste it all the way down the list, but before I do that, I need to add a little dollar sign over here in front of the D and the one.
That's just telling Excel, don't change from this value, always compare to this value. Because what it's going to do is as I click and drag this down real quick I'll show you, it's going to always move the blue number down. So it's taking the right value for each row, but it's always going to be comparing to this total number. And so if I click and drag and essentially copy and paste this formula all the way down to the bottom of the list. All the way down to the poorest country on the list. Eventually, I should get down to 0%, meaning that 0% of countries are lower on this particular list than Tuvalu.
So here we see the numbers in decimal form. So the other thing I'm going to do is I'm actually going to format this. I'm going to just going to change this so that rather than showing me just general numbers, I'm just going to change it to a percentage with zero decimal places. And now I can very easily see the percentages. 99% of the countries are below the United States, 98% are below Germany, and if I again look at Sudan down here on my list, Sudan is in the 62nd percentile, meaning that 38% of countries have lower GDPs than Sudan.
So calculating percentiles is a really helpful task in converting your data to help you look at the data in terms of a portion of a whole. So now we're going to talk about aggregating data. The best data you can get in most cases is going to be raw data. Where you have the most granular data possible. Cause the more data points that you have, there is really no limit to the things you can do with it. You can do any aggregating and sorting and filtering of the data, for the project you want to do, either on the fly or ahead of time, before you create a stack visualization.
So I was going to be doing a visualization about partisanship in Congress, and the goal was to show which congress people were the most partisan and which ones were the least partisan. And so the data that I had available to me was, every single vote taken by every single person on every single bill in Congress for an entire year. There was hundreds of thousands of records. So each row here I have a person, each person ID means one person, each bill means one bill, so as I scroll down you can see how many rows of data there are.
And this is just 10,000 rows out of hundreds of thousands of rows, but this is a very large data set. It would've been very taxing on the server to be calculating this on the fly, when really all I needed to know was for the 500 and something people in congress who voted with or against their party the most or the least? So what I did is I created a script to go through each row of this raw data and, I'm just going to scroll down to where we have actual votes. And here we have yea's and nay votes, okay? And so the script goes through and says for this particular bill, this particular person voted yea.
And then the script would look at all of the votes for this particular bill, and say, okay, there are this many yeas, that were Democratic yeas, and this many yeas that were Republican yeas, and then based on that total, and based on how this person voted, it figured out whether they voted for or against their party on this particular bill. And it summed it all up essentially in one row for each person. So rather than having one row for every single vote, here I have, for these particular congresspeople, how many times they voted with their party, how many times they voted against their party, and this is actually for a partial year.
So it took hundreds of thousands of rows, and gave me, essentially, 540 rows of data. Much easier to work with, much easier for the server to handle when trying to create this visualization. Another common issue you'll come across when you get data from people, is that it comes in a format, in a grouping, a categorization that's really helpful and natural for them, and there purposes with the data, but it isn't necessarily that helpful for you. So, here we have the minimum wage data that we were looking at before. And this is actually in the form that is a very common that you'll see data, where you have one row of data for each item.
Kind of like the voting data we just looked at, right, for this category and this year, here's the amount. And then the other category are lower down so every row is its own thing so I can't really look at it side by side, and I can't necessarily do things with these numbers to compare them to each other as easily. So I might want to sort of flip them on their side and look at em in a different way. Now of course I can just copy and paste them and create extra rows and columns, and move things around manually or, I can use a great tool that's built into Excel call the Pivot Table. There's actually a great course here on lynda.com called Excel 2013 Pivot Tables in Depth, and I strongly recommend that you take a look at this.
It's a really in depth view of what pivot tables are, how they work and how you can create them. But I'm going to give you a quick introduction to it. So, for this data I'm just going to select all the data and take that data, and I'm going to go up here to the data menu item, and I'm going to say Pivot Table. And it'll automatically ask you know, it now knows what data I want to use for that pivot table. I'm going to tell it to put it into a new worksheet, I'm going to say okay. And then I get this thing here, which essentially this empty pivot table and this is part of that fine art and science at creating pivot tables; which you'll learn a lot more from in that other course.
But, I'm going to take category and just drag and drop it into my column labels. And so now you can see my column labels match the categories of data, which were rows before. And then I'm going to take my year and drag that into my row labels, and so now I have a row for every single year, and now I just want to fill in the data here. And by just dragging this amount into the values area, that's going to say bring my mount and put that into the values. And so it's a little hard to grock in this quick introduction to it.
But this is a very powerful way of being able to essentially as I said, take your data and sort of flip it on the side. Sort it, filter it, look at it from different directions, and get it into an order which makes it easier for you to work in to create your visualization. And a final task that you'll find yourself doing quite a bit in converting your data, is converting from one form to another. So very frequently, you'll get data in Excel, right, or a common delineated file, that's text that has commas in between all the fields. And you'll need to convert it into formats like JSON, XML, or MySQL which are very common formats for web programming.
And they are surrounded by curly brackets and sometimes straight brackets. We don't need to get into what this all represents right now, but this is that GDP data we were looking at earlier in JSON format. Now one of the things you can do is, if you need to convert from JSON to another format, the quickest and easiest thing to do if you don't have the tools available, or you're not quite sure what to do is just Google, convert JSON to XML. And you're going to find lots of tools and websites, including a lot of free resources for converting from one format to another. You might find yourself doing this more than you think, especially converting some of the stranger formats out there into other alternate formats.
You'll also be happy to know that a lot of modern programming languages, like PHP for instance, have functions built into them. So, for instance, PHP does not natively work with JSON, but you can, with one line of code, convert from JSON into a PHP array, which is a very common data format for working in PHP. You should expect to have to massage your data and in fact, if you can, I would learn to embrace it. Get to know the tools and the languages that are out there that'll make your life easier.
It's never a perfect and pretty process. But it actually can be fun and interesting. It can be a very insightful part of that data exploration process, that we're going to talk about in the next movie, because as you can convert to your data, you'll see and find new and interesting things. It is inevitable, so try to embrace it.
- Channeling your audience
- Understanding your data
- Determining the information hierarchy
- Sketching and wireframing your ideas
- Defining your narrative
- Using typography, color, contrast, and shape to convey meaning
- Making your visualization interactive