From the course: NLP with Python for Machine Learning Essential Training

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Implementation: Tokenization

Implementation: Tokenization - Python Tutorial

From the course: NLP with Python for Machine Learning Essential Training

Start my 1-month free trial

Implementation: Tokenization

- [Instructor] Lets jump in where we left off previously. If you're just joining us, go ahead and re-run all the cells prior to this tokenize heading. Now that we've removed punctuation, we can move on to tokenizing our text. As we discussed previously, tokenizing is splitting some string or sentence into a list of words. We learn that you have to account for (mumbles) cases in your strings, like if they're separated by special characters or multiple spaces. So we'll just use what we learned in our lesson about regexes, and combine that with the approach we learned in the last lesson, where we removed punctuation by writing our own function, and then applying it to our data set using a lambda function in order to tokenize our text. So you'll use the same read package that we used last time, and then we're going to go ahead and define our own function, and call it tokenize, and it'll accept some text, and so the first thing that we're going to do is again, call this, re.split function,…

Contents