From the course: NLP with Python for Machine Learning Essential Training

Using stemming - Python Tutorial

From the course: NLP with Python for Machine Learning Essential Training

Start my 1-month free trial

Using stemming

- [Instructor] Now that we've learned what stemming means, we're going to put it to use. We'll do this in two stages. First, we'll test out the stemmer on specific words to understand how it works. Then we'll apply the stemmer on the SMS spam collection data set to further clean up our data. So first, we'll import the NLTK package. And then from that NLTK package, we'll call the porter stemmer, and store that as PS, so that we can call it later. So let's go ahead and run that cell. Let's use the DIR function, to see what attributes and methods are contained within this porter stemmer. So run that, and there are a lot of different attributes and methods that you should explore, but this stem method is really the one that's used most commonly, and it's the one that we'll be focusing on here. Let's look at how the stemmer will handle a couple of examples. Let's try out this grows, growing, and grow example that we've been looking at. So, we'll start with ps.stem, and then we just have to pass in the word that we want it to stem. So start with grows, and we're going to copy this, and paste it two more times, and then we just have to replace the word that we actually want it to stem. So we'll add growing for the second one, and then the last one we'll leave as just grow. Now because we have three function calls in the same cell, Jupiter notebooks will only print out the result of the last one, but we want it to print all of them out. So we have to explicitly tell it to print out each of these three. So, we'll add our print call for each of the three. And then we can run it. So that reduces them all to the proper root word of grow. So now these three words can be treated as the same word, rather than Python seeing them as three distinctly different words. So let's look at one more example. So we'll copy this code down, and we showed in the slides how the stemmer isn't perfect, where it stemmed both meaning and meanness down to mean, even though they don't represent the same thing. However, if you look at a different example that could be a little difficult, we'll do run, running, and runner. You could see how all three of these might be reduced down to just run. Even though the first two are actions and the last one describes a person. So let's run it and see what the stemmer actually does. So the stemmer can actually tell that the first two are different than the last one in some way. So stemmers certainly aren't perfect, but they still do a pretty good job of identifying words that have the same meaning. Alright, so now that we've learned a little bit about how the stemmer works on toy examples, let's apply it to the data set we've been using in the SMS spam collection data set. So let's start by importing the packages we need to read in the text and clean up our data. So we'll import pandas, the repackage, and then string. Then we'll also call the same option that we set in previous lessons to display more of the text message. And then we'll store our stop words as stop words for use when we clean up the data. And then we'll read in the data and assign column names in the same way that we have previously. Now let's just print out the first five rows. And you'll see, just as we expected, and again remember this is the raw text, this isn't the cleaned version. So now for the cleaning of the text. This function is exactly what we put together in the last chapter, but I just collapsed the three functions down into one. So we still have the same components. So we remove punctuation the same exact way. We tokenize it, and then we remove stop words. And then we use the lambda function to apply it to the data set. So let's just call the first five rows and let that run. To make sure that prints out what we expect. So this is your tokenized list without any punctuation or stop words, just how we ended the previous chapter. Let's get into some new stuff. We saw that the ps.stem method is what stems each word. So the column that we'll be operating on from this data frame is this tokenized list. So we'll want to iterate through the list and stem each word and then return the stemmed version back to the list. So this should be starting to sound familiar at this point. We'll again write our own function using ps.stem within list comprehension in order to stem each word. So let's call our function stemming, and it'll accept a tokenized list of text. Now, if we recall with list comprehension, the way that we'll define it is word for word in tokenized text, and we'll assign that to text. So this will just return each word for each word in tokenized text. In other words it'll take each word in this list, and just output each word. So all we need to do to stem it is just say instead of returning the original word that was in the list, apply the stemmer to that, and return the stemmed word. So then that'll store all the stemmed words in a list called text, and then we'll just return text. And then lastly, just like we did before, we're going to use a lambda function to apply this to our data. And let's store it in a column called body text stemmed. And then we're going to run our lambda function on body text no stop. And then we want to apply the stemming function using this lambda function. So we'll lambda x stemming x. So again, this will apply the stemming function to each row in our data frame. So we're going to run that and print out the first five rows. Now it's worth noting that the stemmer won't do a great job with slang or abbreviations. So it's probably not a great fit for a text message data set. I'll call your attention to a couple things. First, entry is changed to entri with an i so it could also accommodate plural, entries. Same thing with wkli. Another one is on the second line, lives is reduced down to live. So, now you've learned what stemming represents, and how to actually apply it. Stemming helps us reduce the corpus of words that the models are exposed to, and it explicitly correlates words with similar meaning. So in the next two lessons, you'll learn about lemmatizing, which is a different way to accomplish the same goal.

Contents