Join Barton Poulson for an in-depth discussion in this video Text mining algorithms, part of Data Science Foundations: Data Mining.
- [Instructor] We can say there are two very general categories of algorithms for dealing with text. One does the intuitive thing, and it focuses on the meaning of what's being said. So, for instance, you'll have algorithms that will identify parts of speech, this is a verb, this is an adjective, and so on. It'll identify sentiment, this is a positive statement, this is a negative statement. And it will use the meanings of words, like in the topics of a text, to analyze the text. That's pretty sophisticated processing. What's interesting, though, is that this other approach, the quote/unquote bag of words, also works a lot in this situation.
These are methods that treat words simply as individual tokens of distinct categories without even understanding their meaning. They could shapes for all you know, or numbers. In fact, it turns them into numbers. You lose the order, you don't look at the particular function of a word, you're simply counting how often it happens, and maybe what it happens next to. So let's say a little bit about that second one, the Bag of Words, because it sounds strange. Now interestingly, a lot of machine learning algorithms work this way. They kind of break the text down into chunks or tokens, and just analyze it like that.
So Naive Bayes is one, neural networks is one. I can also k-means clustering, support vector machines, the common TFIDF, which stands for Term Frequency Inverse Document Frequency vectorization, and so on, and sometimes you're simply marking whether a word is present in a document or not, that's binary presence, or you weight it by how frequently it occurs, that's the TFIDF. Either one, you're still gonna get meaning out of what you're doing. Now in the more sophisticated meaning based approaches, this gets into the field of NLP, or natural language processing, which is a very big field.
It's how your phone knows what you're saying to it. Now, I should specify, technically it's still not meaning. It's still a digital machine, it's still turning it into numbers, but it's doing a more nuanced approach. So for instance, this is where you get something like what's called a Hidden Markov model or HMM. This is where it's trying to get at changes in operations, and inferring some of the behaviors behind what's happening. Or you can get to something like a Latent Dirichlet allocation, an LDA, that uses topic modeling, where it actually tries to decide what is the topic of the paper to form unobserved groups that can be used to understand text.
And to paraphrase a very well known saying, we have here what can be called the unreasonable effectiveness of meaninglessness. What's very strange is that while the natural language processing does accomplish a lot more, you can still get very useful things done, even without treating the words as words, but simply as tokens or categories. The algorithms for mining text vary in their emphasis on meaning. Some place a lot of emphasis and try to model it with great care, others ignore it completely. Interestingly, the simple methods, the plain old Bag of Words simply indicates whether a word occurs or not can be sufficient for certain tasks.
And the more complex methods are reserved for natural language processing, where the computer, for instance, is trying to understand what you're saying, infer your meaning, and answer your questions from it. Either way, you want to choose an algorithm that fits your goals in your task and helps you get the insight that you need for your particular data science project.
Barton Poulson covers data sources and types, the languages and software used in data mining (including R and Python), and specific task-based lessons that help you practice the most common data-mining techniques: text mining, data clustering, association analysis, and more. This course is an absolute necessity for those interested in joining the data science workforce, and for those who need to obtain more experience in data mining.
- Prerequisites for data mining
- Data mining using R, Python, Orange, and RapidMiner
- Data reduction
- Data clustering
- Anomaly detection
- Association analysis
- Regression analysis
- Sequence mining
- Text mining