From the course: Ethics and Law in Data Analytics

Explore the compassions data set: Part 2

From the course: Ethics and Law in Data Analytics

Explore the compassions data set: Part 2

- [Instructor] Okay, moving on. What is the likely meaning of the various columns? Now, when you get a dataset, what's supposed to happen is that you get a data dictionary with it, because if you look at these, these are not user friendly names, right? These names score underscore text isn't meant for the consumption of the general public. And here, it gets even worse. There's a letter v_Type, right? So, this is not user friendly and there should be a separate document that's called a data dictionary that tells you what this is abbreviated for and then it describes in some depth what that means and then it tells you the range of possible data. It could be a one, it can go all the way up to 10. But that document doesn't appear to be here, so we're going to have to do a little bit of sleuthing to figure out the answers if we need them. Okay, so one way to do this is to look at the results and that might help us figure out what the row means. Like, for example, c_charge_des. What does that mean? Well then, when I look through here, I see aggravated assault with a firearm, felony battery. And so now, I can go up and say, oh, this probably means the description of the charge, okay? And you can do that for a lot. There's going to be a lot of information you get that way. But you also have to be careful. So, if you look at, for instance, Column W. All the data are M's and F's, which makes me think of males and females. So, maybe that's male/female category. But then, charge c_charge_degree, that doesn't really seem like it should be the title of this column if we're just talking about males and females. And I'm just going to go ahead and run a filter on this column to see if there's any other results. 'Cause remember, we're only looking at a very small percentage of the results right now. So, let's go ahead and see if there's other things besides M and F. So, I have put a filter on it here and then the filter displays all the results here and it looks like there's only M's and F's here. Huh. Okay, I have a new guess, actually. Charge description, I bet this stands for misdemeanor and felony, that's my guess now. Seems like a pretty good guess. I can't be 100% sure without a little more information, but that's what I'm going to go with now. This is what you have to do when you get the context. So, you have to kind of sleuth around and solve some mysteries. Okay, another way to do this is by getting some information on the web. So, let me do something that you probably don't actually need me to do. Let's say that you are at a loss for what this column DOB meant. Now, you probably already can guess what DOB means, so this is just an example. But here's how this could be useful. So, let's go to Bing, our trusty search engine, and say DOB, putting it in parentheses, and I'm going to ask it what kind of abbreviations are there for DOB. Department of Buildings. Doesn't seem likely to make sense. Ooh, date of birth. Ooh, I like that one, that's good. It might also stand for (speaks German), but probably not in this context. I wish it stood for do our best, that would be really great for the Cub Scouts of Canada, but I think it's probably date of birth. And you could check yourself by, you know, looking at, oh, okay, yeah. Date of birth, that makes a lot of sense. It's probably not the Cub Scout thing. Okay, another thing is you might actually know what the column literally stands for, but you might not understand enough about what it means. So, this column here, I thought it was pretty important because it's guessing, predicting, I shouldn't say guessing. The algorithm's trying its best to predict the risk of recidivism. And, I mean, recidivism can mean a lot of things. I kind of want to find out what it means in this context specifically. Like, what actual marker are they using for "Are you a recidivist?". Like, if you get a parking ticket 20 years later, are you guilty of recidivism? I mean, what are they going with here? So, what I'm going to do is going back to our trusty search engine Bing, I know that this algorithm is called COMPAS and it's made by a company called Northpointe. So, I'm going to see if they put out some information. Okay, so here is, oh look what Bing did for us. There's a PDF, it's a frequently asked questions. So, let's just see what we have here. I'm just going to Control F and look for the word recidivism. Okay, so I've already got some things here. COMPAS is scalable. Okay, this looks like it tells me what the recidivism thing is. The general recidivism risk scale was developed, blah blah blah, within two years of the intake assessment. Okay, so they limit it to two years and so, it looks like if you get a parking ticket, that you're not counted as a recidivist. It's got to be something a little bit more serious, like felonies or person offenses. I'm guessing parking tickets aren't person offenses. Okay. So, that was this next question. Is there any information available from the web that might help you? Okay, so now, let's turn to a second question. The things we just went over, the getting the context, you should be doing that no matter who you are. If you're a data scientist, you have to understand the data that you're working with. But in this course, of course, we're concerned especially with the law and ethics and so, we're going to have a separate step here of getting the ethical context. So, let's see if there's any variables that seem ethically important. Now, if you remember from early in module one, we know a little bit about ethics now. When we talk about whether something's ethical, whether a person is ethical, we're really talking about are they concerned with one of the five values. So, these are the values of non-suffering, autonomy, equality, trust, character virtue. Those are the things that really enter into the discussion when we're talking about ethics. And we want to know if any of the variables are ethically sensitive. So, let's go back to our dataset and, you know, one of our ethical values is autonomy. Autonomy is the capacity that people have to make their own decisions, to live their own lives, to set their own goals, and to carry out those goals in the way that they see fit. And what the ethicists say is that means we have a special obligation to honor their autonomy, to not interfere in their lives. We have to respect that they're autonomous. So, I'm not saying it's bad that they're doing this, but when we talk about predicting how likely it is that somebody is going to make a bad choice and commit a crime again and go back to jail, these questions are sensitive to someone's autonomy. I mean, we're making a prediction about what they're going to do before they've done it. So, I'm not saying we shouldn't care about that. I'm not saying we shouldn't try to use algorithms to predict. I'm just saying that these things need to be on our ethical radar screen. We need to be thinking that this is important from a perspective of autonomy. Another thing that sticks out here ethically, as I look through this, is there was a column I noticed that listed race. So, and there's another one there that lists gender also. So, whenever we're talking about race and gender, protected classes, I start to wonder if there's going to be some kind of inequality in these algorithms. Is the algorithm going to be harsher on women or harsher on African Americans or something like that? So, this could actually be pretty sensitive in terms of equality. Alright, that's most of the ethical context here. There's a second question that you probably should ask yourself. Was there any data collected that might give you some insights into ethics? Here, I don't have anything offhand, so we'll just move on to step three. So now, we've done the work of step one, getting the general context, understanding the data themselves and step two, understanding what is ethically relevant about this data and now we're moving on to step three. We've going to think about the relationships between the data and there's no rules for how to do this. I mean, you just got to play around. Use your intuition. Hopefully, your intuition's been a little bit more informed now that we've done step one and step two. But you want to start to notice how these data are related to each other and because we're using Excel, Microsoft Excel and not Microsoft SQL, we'll use a pivot table. Pivot tables are very user friendly. They're easy to use. You can probably pick it up just from watching me if you don't know, but there's other courses in this series that will help you, also. So, a pivot table, what I want to do is go to Insert, Pivot Table, and it's going to ask me if I want to include all the data at first and I do. And I'll open it in a new worksheet. Okay, so there's my pivot table and if you look over on the right side of the screen, there's options for if you want to put things in rows or columns, do you want to make things filters, and then all the variables are listed here, okay? So, I'm interested in this question. When we have the multiple choice quiz, there'll be some other questions like this, but I'll just go through two with you here. I'm wondering, is there a significant discrepancy between the recidivism predicted of white defendants and other races? That would be pretty interesting if we could figure that out quickly. I mean, we'll still have more work to do, but that'll be interesting if we can tell that that's an issue. So, what we're going to do is let's take Race and we'll make that a row and see Race is populated as different row labels. And then, if you look back here, there's predicted recidivism and it's listed as, it says decile_score. So, that appears to be how risky they think you are to commit crimes again and if you look at the next text, they have it broken out by low, medium, high. So, it looks like one is the lowest you can go and 10 is the highest you can go. So, probably one to 10.

Contents