From the course: Ethics and Law in Data Analytics

Explore the compassions data set: Part 1

From the course: Ethics and Law in Data Analytics

Explore the compassions data set: Part 1

- Congratulations on getting this far in module one. Before you take your final assessment for this first module, Before you take your final assessment for this first module we just have one more important thing to do, we just have one more important thing to do, and that is we're going to get our lab started. Each module is going to have a lab at the end of it. And we're going to use the same data set, And we're going to use the same dataset or a very similar data set, there's a couple versions of this data set So if you look at your instructions, So if you look at your instructions, they tell you that you should download they tell you that you should download a set of documents from GitHub, a set of documents from GitHub. and you're going to have to download them as a ZIP folder. And you're going to have to download them as a zip folder. So I've already done that. As you can see, this is what you You can see, this is what you should be looking at should be looking at when you go to GitHub. when you go to GitHub. You're going to have to create an account. You're going to have to create an account, it's free. It's free. So there's a bunch of files associated with this project. So there's a bunch of files associated with this project. So what you're going to do is go to the clone or download, So what you're going to do is go to the clone or download. and then download the ZIP, and then of course And then download the zip and of course then unzip it. then unzip it. And I would like you to pull up the CSV file that says comma scores two years violent. that says compas-scores-two-years-violent. And let's get to work on that. Okay, so regarding this project. Okay, so regarding this project, You're going to have several specific ethical you're going to have several specific ethical and legal questions and legal questions to answer in the next few modules to answer in the next few modules about it. about it. But right now, we just want to figure out what this data is, what it means. what it means. Another way to think about that is we're getting Another way to think about that is we're getting the context. the context. This is something that you do no matter what This is something that you do no matter what when you're a data science. when you're a data scientist. You always have to get the context first, You always have to get the context first and then answer the specific questions. For our concerns today, For our concerns today we're we're also getting the ethical context. also getting the ethical context. But let's start with the most general stuff, But let's start with the most general stuff, just exploring the data. So this data set is about recidivism, which is how likely it is that someone is going to reoffend. So once someone's been in jail we wonder So once someone's been in jail we wonder are they going to are they going to go to jail again? go to jail again? Are they going to commit another crime? Are they going to commit another crime? This is a very fundamental problem of law enforcement because they have to make decisions about because they have to make decisions about how long to keep somebody. how long to keep somebody, should we keep them in jail for a long time? Do they not deserve to go to jail? Do they not deserve to go to jail? So this data set is regarding that general problem. So this dataset is regarding that general problem. So we got to figure out what's going on with this. So you can see there's three steps here, and each step is going to have a few questions a few questions attached to it. attached to it. So let's just start with some really, really So let's just start with some really really basic things basic things about exploring the data. about exploring the data. So here's a question. How many observations are in this table. So in data science, an observation basically corresponds basically corresponds to a row. to a row. So in this data set and most data sets that you'll that you'll ever encounter, ever encounter, each new time an observation is made, each new time an observation is made there's a new row created. there's a new row created. So here we observed a new person, we made some observations about them, and that's a new row. So the question is how many observations are in this are in this dataset? data set. So here's one way you can do it. So here's one way you can do it. The easiest way in Excel is just by clicking on the column and then down here it'll tell you 4,744. a column and then down here it'll tell you it's 4744, Which is extremely small. which is extremely small. This doesn't really count as big data, but that's okay, But that's okay, we're just practicing here. we're just practicing here. One thing, there might be some empty boxes, Now one thing, there might be some empty boxes. so you might want to go to a column that is the least So you might want to go to a column that is the least likely likely to have empty boxes. to have empty boxes so the id column, So the ID column it seems very unlikely that someone it seems very unlikely that someone would have been would of been counted as an observation without counted as an observation without having an ID attached to them. having an id attached to them. So I'm going to click on that. So I'm going to click on that, yep, still 4,744. You have still 4,744. Maybe just click on one other, just yep. Maybe just click on one other. Yep. But as you can see, some of the columns, But as you can see some of the columns, so like this one I can already tell that it has some that it has some missing data. missing data. So if I click on this I'm going to get 4,476. So if I click on this, I'm going to get 4,476. Again, this is, that's inconsequential. Again, this is, that's inconsequential. But if you're dealing with a huge data set and you start looking at a column that says you have and you start looking at a column that says you have four million data points four million data points when actually you have when actually you have 10 million data points ten million data points, that can really throw things that can really throw things off for you. off for you. So that's the answer to that first question. So, that's the answer to that first question. Number of observations we've got 4,700 and something. How many variables? How many variables? So a variable is corresponds here to a column. And let's see, we can do this easily with Excel, 'cause they do the A, B, C, D thing. So we've got to Z, then we've got to Z again, and then we've got two extra ones. So 26 plus 26, 52, 53, 54 variables. So 26 plus 26, 52, 53, 54, variables, some data scientists call them attributes. They're the small technical difference, There's a small technical difference, but it doesn't matter too much. And if you want to be precise, And if you want to be precise you could say that, you can say that well this is the person. well, this is the person. Attributes are like characteristics about the thing Attributes are like characteristics about that you're examining, the thing that you're observing. the thing that you're examining, the thing that you're observing So you could say well this is the person, this is their name broken into pieces. So really it's like we have 51 variables regarding 51 variables regarding this person. this person. Alright, which variables appear to be from collected data. All right which variables appear to be from collected data? So here's what I mean. There's a bunch of variables here, There's a bunch of variables here. and as we're getting the context I just want you to think And as we're getting the context about which of these pieces of data were just collected. I just want you to think about which of these pieces of data were just collected? So I would think name is collected, obviously. So I would think name is collected, obviously. The age, that's going to be just collected data. The age, that's going to be just collected data. They just get that from somebody. Gender, so this is really basic stuff, So this is really basic stuff, but if you're getting frustrated with how basic this is, but if you're getting frustrated with how basic this is, getting context is extremely important. It's going to make it much more likely It's going to make it much more likely that when it comes to the detailed questions later that you're going to be that when it comes to the detailed questions later successful. that you're going to be successful. Now variables that appear to be assigned by the user. So these aren't collected from the world, these are things that the user just says okay, These are things that the user just says, okay, now we're going to stick you with this. So the ID appears to be assigned by the user. I'm looking through here, are there any other ones? I'm looking through here. Are there any other ones? There might be a couple other ones, but that's just what I mean there by assigned by the user. Which variables appear to be generated by algorithms? So this system fed an algorithm data and it spit So this system fed an algorithm data out a result, and that created a new variable. and it spit out a result and that created a new variable. So, this is predicting recidivism. So this is predicting recidivism, and I'm looking through here, this column says risk This is, this column says risk of recidivism. of recidivism. It appears that it says that for everybody, It appears that it says that for everybody. and then it gives a person a score. And then it breaks out that score by looks like And then it breaks out that score by looks like low, medium, high. low medium high, okay? So I would say that these two rows would be So I would say that these two rows would be generated generated by an algorithm. by an algorithm. Again, this is all just helping us understand understand what we have in front of us. what we have in front of us. Can any variables be safely eliminated? Now you want to be careful here because sometimes Now you want to be careful here because sometimes a variable seems completely unnecessary, a variable seems completely unnecessary, and then later on it turns out to be helpful. and then later on it turns out to be helpful. So you don't want to just go through So you don't want to just go through and just start and just start deleting stuff. deleting stuff, but as a matter of fact, But as a matter of fact, so I already mentioned so I already mentioned that we've already observed that we already observed that that there is less than 5,000 instances here, there is less than 5,000 instances here. so this is you know, doesn't matter too much in this case. doesn't matter too much in this case. But if you have a data set that has millions of observations, and one of the variables and one of the variables is really text heavy. is really text heavy, like if it has a description of something, that's going to really slow down the processing speed. That's going to really slow down the processing speed. So for practice here, let's go ahead and eliminate, see if there's anything that can be safely eliminated. Now I'm a little bit suspicious of what's going on what's going on in BA and BB in BA and BB, because if you look, because if you look, the column name is the same, the column name is the same. Two year recid, in itself that doesn't tell us two_year recid. In itself that doesn't tell us everything we need to know, everything we need to know because because there's lots of entry errors. there's lots of entry errors. People could of meant to type three year People could have meant to type three year and it auto corrected or something, but I am suspicious, especially as you look down here, But I am suspicious, especially as you look down here. it appears to be basically the same numbers. It appears to be basically the same numbers. So what we're going to do is run a correlation. We're going to use Excel formula, correlation, We're going to use Excel formula, correlation. we're going to see how closely these columns are correlated. A one will tell us that they're perfectly correlated. A one will tell us that they're perfectly correlated. That is to say they're the same. Negative one means they're perfectly negatively correlated, and zero means they have nothing to do with each other. And zero means they have nothing to do with each other. So if we get a number close to one, I'm just going to get rid of one of 'em to save I'm just going to get rid of one of them to save space for processing speed. space for processing speed. So let's go to formulas, and we're going to need a statistical formula. need a statistical formula. And a correlation is the first six letters of it there, So and a correlation, there's the first six letters of it there. and they're going to ask us which arrays we want to correlate here. So we want BA, and we want to correlate that, and we want to correlate that, we want to check the correlation it has with BB. So let's see what we have. Let's see what we have. The correlation is one, which means they're perfectly which means they're perfectly correlated. correlated, so for whatever reason, So for whatever reason, some data scientist accidentally cut and paste it. some data scientist accidentally cut and pasted, I don't know what happened exactly, but I do know that column BB tells us exactly nothing that column BA doesn't. exactly nothing that column BA doesn't. So let's go ahead and get rid of that one.

Contents