Join Curt Frye for an in-depth discussion in this video Calculating the distance to a centroid, part of Up and Running with Excel Cluster Analysis.
After you create a table of data, and a set of random centroids from which to start your analysis, you need to create a series of formulas that calculate the distance between each data point and each centroid. We'll use the CalculateDistance.xlsx file as our sample file for this movie. The formulas we'll create, calculate what's called the Euclidian distance between two points. Okay, what's that? Well, you probably remember the Pythagorean Theorem, where the two sides of a triangle tell you the length of the hypotenuse.
That equation is a squared plus b squared equals c squared. So the formulas we use in this movie should look a little familiar. If you're working through chapter two as a series of exercises, as opposed to using the sample files, then you might see different values in your centroids table. If that's the case, don't worry about it, everything will still work, it's just your numbers will be a little bit different. We have three columns in our main table that allow us to calculate the distance to the centroids.
Distance to centroid one, centroid two and centroid three. Let's create the formula for the distance to centroid one first. So I click in cell C3, and then I'll start entering my formula. Now remember that c squared equals a squared plus b squared. So that means to find the distance c from a particular point to a centroid, we need to find the square root of a squared plus b squared. And that formula looks like this, it's equal and then sqrt, which is the square root function, then two left parentheses, A3, which is the x axis value for our first point, minus I3, which is the x axis value for the first centroid, and I don't want that changing, so I'll press F4 to make it an absolute reference.
Then a right parentheses. And then I'm going to square that difference. So that is caret, which is a shift 6, and then 2, which is squaring. Now I'll type a plus sign, and this is the b squared component of the equation. So I'll type a left parenthesis, then B3 minus J3, and again, making that an absolute reference by pressing F4, a right parenthesis, and then squared, again caret two, and a right parentheses and enter.
And there I have all of my values, and because I'm in an Excel table, Excel filled in the formulas to the rest of the column. Now I need to create similar formulas for D3 and E3, representing distances to centroid two and centroid three. So the distance is centroid two is equal, squared root, two left parenthesis, A3, minus, and now this time it's I4 because we are calculating the distance of the second centroid. So that's I4, F4 to make it an absolute reference.
Right parenthesis. Squared. Plus, left parenthesis B3 minus J4. F4 to make it an absolute reference. Right parentheses squared, right parentheses close, and tab. And we'll do the final thing for the distance to C3, and this time I'll just type it because I have already talked you through it twice. Two left parentheses, A3 minus I5, F4 for absolute, right parentheses, excuse me, squared.
Plus, B3 minus J5, absolute, right parentheses, squared, right parentheses, and enter. So there we have our formulas. We have now calculated the Euclidean distance between each point and each centroid and now we can find which centroid is closest.
- What is clustering?
- Setting up your Excel table
- Creating a list of centroids
- Identifying the closest centroid
- Updating centroids using averages
- Automating updates with macros
- Analyzing your data in Excel
- Visualizing data in a chart