In this video, discover techniques for tokenizing text in Python.
- [Instructor] The first extraction task that is usually done on a corpus is tokenization. Tokenization is the process of breaking down a stream of textual content into its parts, words, terms, symbols, sentences, paragraphs, and other meaningful elements. Converting text into a set of tokens makes it easy for further cleansing of the corpus. The code for this chapter is available in the notebook named 03_XX Text Cleansing and Extraction. In the previous chapter, we directly used a corpus reader to both read text and convert it into tokens. In this example, we will use a specific tokenize method available in NLTK library. We read the Spark-Course-Description.txt file into a raw text variable. We then use the word tokenize method to convert it into a token list. We then print the first 20 tokens. Let's run the code and see the results. We see that there are 110 tokens that have been in total identified as part of this file. The first 20 tokens are printed here.
- Text mining today
- Reading text files using Python
- Cleansing text data
- Build n-grams databases for text predictions
- Preparing TF-IDF matrices for machine learning
- Scaling text processing for performance