Understand how automated data collection in electronic commerce can become unmanageable.
- One of the difficulties posed by applications such as Google Analytics and Adobe Analytics is that they can provide you with too much of a good thing. Particularly in business applications, you want to use the analytics and data to help guide your decision making about pricing, advertising, delivery channels, product placement and so on. But an application that collects web traffic can easily return hundreds of different variables that describe the traffic and hundreds of thousands of site visits made by tens of thousands of potential customers.
It can be very difficult to decide which variables deserve further analysis and which don't. Furthermore, you can't analyze all those variables on a person by person basis, not when there are tens of thousands persons to analyze. Here is a simple example. I have a little app that I wrote that collects hourly data on Amazon's sales rankings for a few of my books. Here's an example of what it looks like, and this is a stripped down example, showing results for just part of one day. Columns B through E have reports of 50,000 observations in that data set, where each observation measures the rankings of four books, each hour of each day going back to 2010.
Columns G through J tally a sale if the ranking rose markedly from the prior hour. And what I'm showing you omits 10 other books that I track as well as their electronic editions. So I need a way to collapse all those 50,000 observations on each of four books into a less overwhelming mass of data. You might well have the same sort of problem, although instead of 50,000 dates and times, you might have 50,000 different customers, and you may very well have more than just five books to track, with say 25 different products times 50,000 customers, you can see how things can get out of hand very quickly.
That's where data reduction comes in. The idea is to categorize observations, which might be customers, into a few groups rather than tens of thousands of individuals. That's the idea behind cluster analysis. When you have only one variable to measure, such as a customer's state of residence, the categories are pretty straight forward, but when you have the customer's location, approximate age, approximate income, type of operating system, sex, marital status and so on, you need something more sophisticated, than just tossing all the California residents into one pile, South Dakota residents into another pile, Florida residents into another pile until you have 50 piles.
The same is true for the variables that you're measuring. These variables might represent your product lines, and 25 products might group easily into three or four product sets, but if they don't, or if you suspect that there might be a better way to group them, then a technique called principal components analysis can help you to reduce the number of variables down to a few factors, underlying or latent components without losing a significant amount of information. In this course, I show you how the two techniques of cluster analysis and principal components analysis can put you in a more sensible position to analyze the big data that you've collected.
These techniques can provide you with much stronger inferences regarding what your data is telling you about your customers and your products. This is a brief course, and as such you can't go into all the details of these techniques, but it does drill down to actually running a principal components analysis and a cluster analysis, so that you can see how this data reduction can take place on your company's computers.
In this course, Conrad Carlberg explains how to carry out cluster analysis and principal components analysis using Microsoft Excel, which tends to show more clearly what's going on in the analysis. Then he explains how to carry out the same analysis using R, the open-source statistical computing software, which is faster and richer in analysis options than Excel. Plus, he walks through how to merge the results of cluster analysis and factor analysis to help you break down a few underlying factors according to individuals' membership in just a few clusters.
- Reviewing the problems created by an overabundance of data
- Understanding the rationale for clustering and principal components analysis
- Using Excel to extract principal components
- Using R to extract principal components
- Using R for cluster analysis
- Using Excel for cluster analysis
- Setting up confusion tables in Excel
- Using cluster analysis and factor analysis in concert