From the course: Ethics and Law in Data Analytics

Explore the compassions data set: Part 3

From the course: Ethics and Law in Data Analytics

Explore the compassions data set: Part 3

- [Instructor] We're going to find the variable decile_score and make that, our column. Okay, now it's not populated with data yet, 'cause we want to make decile_score a value also. And this is something that you might want to review, but I don't want them to display as just raw numbers, I want them to be percentages of column totals. Okay, so you always have 100% of each column, but what this is going to tell you is that, by race, who was predicted as likely to reoffend. So 10, remember, that's the one where they're like, we're really sure this person's going to reoffend. And if you look, three quarters of the people that they were sure were going to reoffend are African American, versus Caucasians, only 16%. Whereas, and I don't know what this negative one business is, I don't know if somebody meant to do zero, or if there is a negative one. I don't know what that means, we have to figure that out later, but the Caucasians get all of that. And one, which means a very low risk of recidivism, Caucasians are about half of the ones. So we can kind of convince ourselves of this further with a little table. So let's go ahead and put this into a table. Make it a little easier to read. That one, okay, it's kind of busy here. So let's first of all enlarge it. Second of all, now we might want to include some of these other things later, but for now, let's just go ahead and, ask it to tell us only about African American and Caucasian. Okay, so if you look at the data, blue means negative one, and if we remember, everybody that was negative one was a Caucasian. Now, if you look at the trends in the data, you can see this really nicely with the graph, you see that as they get more and more suspicious that you are going to reoffend, the percentage of people that are African American that they're suspicious of grows the percentage. And the Caucasian percentage seems to be shrinking, as we go down, so when we get to pretty sure we're going to reoffend, right, you remember this statistic, African American's very high, Caucasian's very low. So those are some of the things that we're going to be thinking about that we should be thinking about, and answer to this question, is there a significant discrepancy between the recidivism predicted of white defendants and other races? It appears that the answer is yes. Now an important note, at this point, you are just exploring the correlations, the relationships between the data, the factual relationships. You're already tempted, I can pretty much guarantee you, you're already tempted to make some conclusions about this data, you might say, aha, I knew this algorithm was racist. We can't say that yet, we don't have enough information, there's a lot of detective work that we still have yet to do. All we've observed is that there is some differential treatment. There might be some reasons for that that aren't bad reasons, right? We don't know that yet, okay. So all we can see is that these data have this relationship, we know that for a fact. We don't know why the data have that relationship, that's for a different time. Alright, let's move on to the second question. Is there a strong relationship between prior arrests in the prediction of recidivism? And hopefully there would be, or at least I assume there would be, right? If someone's been arrested a bunch of times, I would assume that their chance of recidivism, if they've broken the law 50 times, it's probably pretty likely they're going to break it 51 times, right? So let's just make sure, let's just make sure we're thinking correctly here. So we're going to go ahead and go back to the data and create another pivot table. And we will ask it to include all the data in a new worksheet. Okay, so in this worksheet, I think it would be nice if we had the score text as our row, 'cause if you go back to the set here, the score text is the one that's, if the recidivism is low, medium, or high. So I kind of like that as the row. There's no rules here, you just have to play around with the data until you find something that makes sense. So let's go to score text. Okay, score text is a row, so that's what we wanted, high, low, medium. Now for some reason, I would think that low would be under medium, but that's not the way this data's displayed. It's only three rows, so, I guess it's not such a big deal. And, for a value, we want to do the count of priors. So here is prior count, I don't know why that's, given a two, that's a mystery for a different time. And what I'd like to display is not the sum, but the average. So let's see what our data is telling us. Okay, so this is in line with our prediction. So you can see here that if someone has six prior offenses, what this algorithm says, and this isn't saying anything that's different than human intuition is that they're more likely to offend seven times, an additional time. Whereas if they've only been arrested one and a half times, their chance of recidivism is much lower. So this is in line with our predictions, and you remember the medium and low has been switched, so it'd be like six, almost four, and then one and a half, which makes sense. So that concludes Module One. There's going to be some questions in the assessment that are similar to this, not these exact questions, but if this made sense to you, you shouldn't have very much trouble with those questions. So thank you.

Contents