Complete a predictive analytics exercise in Python to decide if a chat window needs to be offered to a website visitor based on their propensity to buy.
- [Instructor] Now, we're going to implement the use case we have been talking about. About predicting the propensity of visitors to your website in realtime. When visitors come to your website, they start checking out different links as they explore the product. And what you want to do is, in realtime, based on their actions, predict their propensity and see whether you want to offer chat or not. The exercise files for this one are available under the folder 02_05.
The data file is the browsing.csv file that contains data for this exercise, which we're going to see very soon. And the code is in the propensity notebook. Let's go and look at the data file that we are talking about. This data file contains information about all past sessions by different users. It starts with a session id, then it has a number of boolean variables, which will be our feature variables. These boolean variables carry a one or zero based on the action performed by the visitor.
An images one means the visitor actually viewed various images of the product. A reviews one means the visitor actually viewed reviews for the product. Similarly, we have FAQ, specs, shipping, bought_together, comparison of products, and so on. Finally, there is the target variable, which indicates whether the visitor actually bought the product or did not buy the product. So this is going to be our data set that we're going to use for building the model.
Remember that this is a very small data set, just for the exercise purpose. In real world, you would be using a really large data set, if we want to get real accurate predictions. So let's switch to the notebook. This is the propensity notebook file, and I'm going to walk you through the code, and explain what we're doing here. I'm starting off with importing a number of python libraries here, and I start off by importing the browsing.csv into a dataframe called prospect_data.
Then I just look at the data types to make sure that the data has been loaded correctly. Next, one more data check. Just look at the top file records to make sure that the data is showing correctly once more. Then I do a describe on the dataframe to look at the statistics for various columns. This is to make sure, once again, the data is not skewed in any way, and they kind of conform to what you expect them to be. Next, I perform correlation analysis.
It is very important to perform correlation analysis between the target variable and the future variables to make sure that there are some signals in the data. There is some prediction possible between the future variables and the target variable. So let's look at some of the examples here. For example, reviews has a very high correlation of up to 40% against the buy variable. Similarly compare similar products 20%, and warranty 17%.
There are other variables that do not have a lot of significant correlation. Now what I'm going to do is, I'm going to reduce my feature variable data set to only those features that have some good correlation. I'm creating a predictors dataframe that only contains these columns, reviews, bought_together, compare_similar, warranty and sponsored_links. And I'm creating a target dataframe with just the buy variable. Once my data is sturdy, I'm going to go and do the training and testing split.
For this, I'm going to be using the plain test strip method in this curriculum, and I'm splitting in the ratio of 70 to 30. And I'm going to check out if the sizes are what I expect them to be, 350 to 150 sounds about right. Next, I move on to model building. I'm using the naive_bayes alogrithm available in the sklearn library, the Guassian naive_bayes. I first create the naive_bayes classifier, then build a model using the fit method, applying it on the training prediction analysis and the training targets.
Then I do a prediction on the test data, and then see what predictions I'm getting on the test data. How do we verify the accuracy of the predictions? First I'm going to do a confusion matrix to look at how the yeses and nos line up. Then I can also do an accuracy score to see what is the overall accuracy. I get an accuracy of 72% which is pretty good given that the data size is only about 400 records. In real life, you would definitely use a much larger data set, in the millions, possibly, you know? Now instead of just predicting a one or zero, I want to predict the probability of somebody who wants to buy, and I can do that by using a method called predict_proba.
That is going to give me the probability of whether the user is going to buy or not, and, in this case, it is giving me about a 34% probability. So I'm going to use this to do my realtime predictions. How am I going to do my realtime predictions is that, let us say, a new visitor comes in, and he starts browsing your website. You start collecting data about the five important variables we talked about. Whether the particular visitor viewed the reviews, they looked at bought_together, compare_similar, warranty, and sponsored links.
And, based on this data, I'm going to create an array of all the ones and zeros as you can see here. So I start off with a new visitor. He just came in, he has not done any activity on the website so all the variables are zero. So I predict his base propensity, and that base propensity comes out to about 4%. Then, let us say, he goes and checks out similar products. So I put a one there for similar products, again check out his propensity.
Now it comes to close to 10%. Next, let us say, he is going and checking reviews. So I put one for checking reviews, do my prediction again, it comes to 57%. So, as you can see, as the user performs various actions on the website, I keep recomputing the propensity, and if the propensity exceeds a pre-defined threshold, maybe 50%, maybe 70%, that depends on your business, then you can decide to offer a chat window to that particular visitor asking him if he needs any more help.
So, remember that when these ones keep coming in, it is not important for the propensity to always go up, it might even go down. It all depends upon how the data is. So then you can, at any point, decide when you want to offer a chat window.
Start off by learning about the various phases in a customer's life cycle. Explore the data generated inside and outside your business, and ways the data can be collected and aggregated within your organization. Then review three use cases for predictive analytics in each phase of the customer's life cycle, including acquisition, upsell, service, and retention. For each phase, you also build one predictive analytics solution in Python. In the final videos, author Kumaran Ponnambalam introduces best practices for creating a customer analytics process from the ground up.
- Understanding the customer life cycle
- Acquiring customer data
- Applying big data concepts to your customer relationships
- Finding high propensity prospects
- Upselling by identifying related products and interests
- Generating customer loyalty by discovering response patterns
- Predicting customer lifetime value (CLV)
- Identifying dissatisfied customers
- Uncovering attrition patterns
- Applying predictive analytics in multiple use cases
- Designing data processing pipelines
- Implementing continuous improvement