From the course: Advanced NLP with Python for Machine Learning

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Prep the data for modeling

Prep the data for modeling - Python Tutorial

From the course: Advanced NLP with Python for Machine Learning

Start my 1-month free trial

Prep the data for modeling

- [Narrator] As a recap, we now know four different ways to capture the information in text data and then fit a model on top of it. So we reviewed TF-IDF and then we learned about Word2Vec, Doc2Vec, and recurrent neural networks. In this chapter, we're going to compare the ability of our different techniques to classify text messages in our dataset as spam or ham. In order to expedite this process, we're going to clean and split our data and then save that as their own datasets so we don't have to repeat that process in each video. This also ensures that each model is training and evaluating on the same exact data. So let's start by reading in our data, converting the spam/ham label to a numeric/binary label, and cleaning our data. Now let's split our data into training and test sets. I want to note that we're just using a single holdout test set for the duration of this course, rather than a test set and a validation set…

Contents