Join Barton Poulson for an in-depth discussion in this video Sequence mining algorithms, part of Data Science Foundations: Data Mining.
- [Instructor] When it comes to algorithms for sequence mining, there are a few different categories of choices. Let me show you some of the common ones, and they go by initials and acronyms, GSP and SPADE and FreeSpan and HMM. I'll tell you a little bit more about each of these. So, for instance, GSP, which stands for Generalized Sequential Patterns, this is a very common approach. It's sort of a basic one. And it's very similar to the Apriori analysis that we use in association analysis, but in this case, it does observe or respect the order of events.
And so it says if you have X, then Y, then Z, that's different than X, then Z, then Y, or different than X, Y, and Z simultaneously, whereas those things will be treated equivalently in a standard Apriori market basket association analysis. One of the things about GSP is it allows you to use a sliding window to determine when events are simultaneous or should be treated as simultaneous. Now the trick is, like Apriori, it has to make a lot of passes through the data. And so if you have a large data set, GSP can be a rather slow way of trying to look for sequences.
An alternative is SPADE, which stands for Sequential Pattern Discovery using Equivalence classes. You gotta be a little creative to get SPADE out of that. But SPADE is designed to be faster. What is does primarily is it does fewer database scans by using what are called intersecting ID-lists. It starts with what's called a vertical ID-list in the database and it goes down through that once, scans through it once, and then sort of stretches it out into a two-dimensional matrix of the first and second item.
Then it can use the results of that sort of as the first item and add on a third one and so forth. So it's got some alternatives. It's faster, but there are other options that may be even better. For instance, one common choice is what's called PrefixSpan. And this is an example of what's called a pattern growth algorithm as opposed to an Apriori-based algorithm. And this avoids the candidate generation part of Apriori altogether and instead focuses on a restricted portion.
PrefixSpan, by the way, stands for Prefix-projected sequential pattern mining. And what this does is it takes the first pass, it gets some patterns in the database, splits them into sub-databases, and then grows them independently in each of those. And it's a much faster algorithm, but it can be memory-intensive for doing a lot of things simultaneously. Another choice that's in an entirely different category is HMM, which stands for hidden Markov models. And this is mostly looking for a different kind of question.
The other ones are trying to find this, then this, then this. Hidden Markov models are looking for switches in state conditions, or you might want to say qualitatively distinct patterns of behavior. Is a person switching from one mode of doing something to a different mode of doing something? Or is the economy switching in some way? Now what's really neat about this is it's easy to test specifically-framed hypotheses to see how well your theory about states matches the data using this particular approach. Now, in addition to these legitimate approaches and several other related ones, I will mention that there are some hacks.
And so these aren't, properly speaking, algorithms for sequential analysis, but they allow you to make sense out of data that has some order in it. Number one is decision trees. And even though we use those for a lot of other things, for categorizing between two groups or something like that, if at least one of the variables in your data set measures previous conditions, so you actually capture a little bit of before and after, then this gives you a short-term time perspective that can be incorporated in making the decisions in the decision tree.
Another one is logistic regression, in which you model a bivariate outcome. And like with decision trees, if at least one of your variables has t minus one, that's time minus one, or the time before, then you can get a really nice parsimonious model of the current outcome that's easy to interpret. Many of these algorithms, many of the most common ones in sequential mining, are based on Apriori association analysis. On the other hand, some of them serve different tasks. HMM, or hidden Markov models, instead, those test for state changes.
And either way, depending on what you're trying to get out of your data, even basic hacks, sort of adapting or misusing common procedures like decision trees and logistic regression can still provide useful insight into your data while capturing some of the sequential elements.
Barton Poulson covers data sources and types, the languages and software used in data mining (including R and Python), and specific task-based lessons that help you practice the most common data-mining techniques: text mining, data clustering, association analysis, and more. This course is an absolute necessity for those interested in joining the data science workforce, and for those who need to obtain more experience in data mining.
- Prerequisites for data mining
- Data mining using R, Python, Orange, and RapidMiner
- Data reduction
- Data clustering
- Anomaly detection
- Association analysis
- Regression analysis
- Sequence mining
- Text mining