From the course: NLP with Python for Machine Learning Essential Training
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Implementation: Tokenization - Python Tutorial
From the course: NLP with Python for Machine Learning Essential Training
Implementation: Tokenization
- [Instructor] Lets jump in where we left off previously. If you're just joining us, go ahead and re-run all the cells prior to this tokenize heading. Now that we've removed punctuation, we can move on to tokenizing our text. As we discussed previously, tokenizing is splitting some string or sentence into a list of words. We learn that you have to account for (mumbles) cases in your strings, like if they're separated by special characters or multiple spaces. So we'll just use what we learned in our lesson about regexes, and combine that with the approach we learned in the last lesson, where we removed punctuation by writing our own function, and then applying it to our data set using a lambda function in order to tokenize our text. So you'll use the same read package that we used last time, and then we're going to go ahead and define our own function, and call it tokenize, and it'll accept some text, and so the first thing that we're going to do is again, call this, re.split function,…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.
Contents
-
-
-
(Locked)
What are NLP and NLTK?4m 7s
-
(Locked)
NLTK setup and overview6m 15s
-
(Locked)
Reading in text data11m 41s
-
(Locked)
Exploring the dataset6m 56s
-
(Locked)
What are regular expressions?4m 8s
-
(Locked)
Learning how to use regular expressions8m 44s
-
(Locked)
Regular expression replacements6m 3s
-
(Locked)
Machine learning pipeline4m 45s
-
(Locked)
Implementation: Removing punctuation9m 10s
-
(Locked)
Implementation: Tokenization3m 37s
-
(Locked)
Implementation: Removing stop words4m 2s
-
(Locked)
-
-
-
-
-