Explore the contents and attributes of a text corpus object in this video.
- [Instructor] As discussed in the previous video, the corpus reader analyzes the input data, and splits them into paragraphs, words and sentences. It also supports various methods to view this extracted content. We will see examples of those in this video. First, we print the list of file IDs. The file IDs are one per physical file read. In our case, we used only one file. We can then extract paragraphs from the corpus using the paragraphs command. Paragraphs are identified by a blank line separating the text. We print the number of paragraphs, which should be one in this case. Next, we extract the sentences in the corpus, using the sents command. This actually gives you a list of lists. Each sentence is made into a list of words, then those individual word lists make up the sentence list. We first print the total number of sentences in the corpus, which gives a value of five. We then just print the first sentence. This should actually print the list of words that forms the first sentence in the corpus. Finally, we print all the words in the corpus using the words (mumbles). This is a long list of all words that I used in the corpus. Let us run the code now, and see the output. There is only one file ID in the corpus, spark course description.txt. The total paragraphs in the corpus is one. The total sentences in the corpus is five. And the first sentence is broken up into a list of words in the sentence. The words in the corpus is a long list, and only the first part of that list is printed here.
- Text mining today
- Reading text files using Python
- Cleansing text data
- Build n-grams databases for text predictions
- Preparing TF-IDF matrices for machine learning
- Scaling text processing for performance