In this video, learn how to plot categorical features to gain an understanding of relationships with target variables.
- [Instructor] We're going to pick up right where we left off in the last lesson. What we really want to understand in this lesson is the relationship between the different levels of our three categorical variables and the survival rate. This will tell us whether for instance, women were more likely to survive than men. This will give us an idea of which features are useful and which are not. So, we're going to use the same categorical plots that we used back in lesson three, and we're going to call that on the three features that we want to explore. This is the exact same code that we walked through before, but just as a reminder, we're looping through our three categorical features. So that's cabin indicator, sex, and embarked. And then we're creating a new plot for each index for the items on our list. And then lastly, we're creating a categorical plot where we're using the feature name, plotting that against survived on the y axis, using the Titanic data set, and we're using point categorical plots. So let's go ahead and run this. So, now if you see this future warning, recall back to the prior chapter were I talked about importing that warnings library that we used to filter out these future warnings. And the future warning is just saying, hey, in a future release of this given package, and this package is sci.py, we're going to change something that you're using here. But, I wouldn't worry too much about this. So, moving on to the categorical plots, just as a reminder the points here indicate the survival rate for everybody at that level, and the vertical bar is the error bar based on the sample size at each level. So, looking at this first plot, this says that people without cabins had a 30% survival rate, and those who did have cabins were around 66%. So, we saw this from the analysis from above, but let's look at the sex feature. We see that more than 70% of the women survived, while only 20% of men survived, so it's really clear that this has really powerful splitting power. So, just these two features alone have quite a bit of power, as we could see by these categorical plots. So, just based on cabin indicator, and gender of the passenger, you can start to imagine being able to get a really good idea of whether somebody survived or not. Again, this is the value in exploratory data analysis. Now, let's look at the embarked feature, so this embarked feature has to do with where they boarded the Titanic, so c is Cherbourg, q is Queenstown, and s is South Hampton. We see that there's some pretty clear separation power here, however this is were we need to apply a little bit of critical thinking. It's unlikely that where they boarded caused them to survive or not, more than likely this correlated with other features that are already being accounted for in our data. For instance, perhaps a higher ratio of men boarded in South Hampton, or maybe more people that boarded in Cherbourg had cabins, and thus were more likely to survive. We can actually explore these hypothesis using pivot tables. Now, if you've ever used pivot tables in Excel, they're a great tool for exploring the relationship between multiple variables. In Python there's a really nice builtin method within Pandas. Okay so, we can call that by just calling our data frame, and then the pivot table method. What we need to pass in, is we'll tell it to only look at the survived column, otherwise it'll look at all the columns, and then we'll say, for the index, we'll use the sex column, and that basically just tells us what the row labels will be, and then we'll tell it for the columns we want to make those labels based on embarked. And then lastly, we need pass in an aggregation function, the default here is mean, but we just want to look at count, 'cause we just want to look at the distribution of where people boarded based on gender. So, we can go ahead and run that, and so this is telling us that 95 people that boarded in Cherbourg were male while only 73 were female. Now you can see that for Cherbourg and Queenstown the number of men versus women boarding is fairly close, but in South Hampton, you can see that more than double the amount of men boarded as women. Given the fact that we know men were much less likely to survive than women, this would explain why South Hampton had the lowest survival rate of all the ports. Next, let's look at the relationship between port and whether they had cabins or not. So, our hypothesis would be that more people that boarded in Cherbourg had cabins, and that's why we saw a higher survival rate. So, we can just copy and paste this line of code down here, and all we have to do is just change the index from sex to cabin indicator. And so if we run that, here we can see that for Queenstown and South Hampton there're drastically more people without cabins than with cabins. So, for Queenstown there's about fifteen times more people without cabins than with cabins, and for South Hampton there's about three and a half times more people without cabins than with cabins. But, then when we look at Cherbourg, it's relatively close, only about 50% more people had no cabin versus having a cabin. Given that we know people that had cabins were much more likely to survive, this would explain why Cherbourg had a much higher survival rate. So, if we go back and look at our plot, we have now explained that fewer people from South Hampton survived, because so many more men boarded there, and a much higher ration of people survived from Cherbourg, because so many people that boarded there had cabins. So, in this section we've learned that cabin indicator and sex have a very strong correlation with survival, and they could be really useful features in a model. We also learned that embarked is not providing much information, that isn't already covered by our other features in the model. Thus, it's repetitive and not really useful to the model. So, we're going to use all these learnings to clean up our categorical data in the next section.
- What is machine learning (ML)?
- ML vs. deep learning vs. AI
- Handling common challenges in ML
- Plotting continuous features
- Continuous and categorical data cleaning
- Measuring success
- Overfitting and underfitting
- Tuning hyperparameters
- Evaluating a model