In this video, create a TF-IDF matrix on a text corpus using Python.
- [Instructor] In this video, we will look at code examples … for building a TF-IDF matrix. … NLTK does not support a simple TF-IDF function, … hence, for this purpose, … we will use scikit-learn library in Python. … From scikit-learn we import the TF-IDF vectorizer package. … We create a simple corpus with a list of sentences. … We are keeping the corpus simple and small … so we can view and understand the TF-IDF array easily. … Next, we initialize the TF-IDF vectorizer. … We also provide a stop-word dictionary setting … so the vectorizer automatically removes stop-words … from this corpus before building TF-IDF. … To create the TF-IDF array, … we simply call the fit_transform method. … Once this is complete, … we print all the featured names or words … from which the array was built. … Next, we print the dimensions of the array. … And finally, we print the array itself. … Let us execute this code and review the results. … We first see the list of tokens from the corpus. … There are only seven tokens and the stop-words …
- Text mining today
- Reading text files using Python
- Cleansing text data
- Build n-grams databases for text predictions
- Preparing TF-IDF matrices for machine learning
- Scaling text processing for performance