One reliable principle of data analysis is the central limit theorem, which says that as the number of measurements increases, the more likely it is for your data to be distributed as you expect.
- [Instructor] One of the dangers of business data analysis is making decisions too soon. The reason is that short term results can be deceiving, but you should start to see patterns emerge as you gather more data. One reliable principle of data analysis is the central limit theorem, which says that as the number of measurements increases the more likely it is that your data will be distributed as you expect. In most cases, your business data can be described in terms of its average or mean and the standard deviation as part of the normal distribution.
The normal distribution is described by the normal curve which is also called the Galician curve. In this case we have a normal curve with an average or mu of 100 and a standard deviation or sigma of 20. As you can see most of the values occur around the average of 100, plus or minus one standard deviation. So what do standard deviations tell us? Well they tell us the approximate percentage of values that will be within one, within two, or within three standard deviations of the mean.
Within one standard deviation plus or minus you expect to see about 68% of values. You can see that the total area under the curve between minus one and plus one standard deviation does seem to account for little over two thirds of the values. Within two standard deviations you will see about 95% of the values. You can see that there is not a lot of the curve left outside of those values. And finally for three standard deviations you expect to see about 99.7% and you can see that the tails of the curve get very close to zero at that point.
It's always possible to see values that are further away but as the 99.7% value tells you, they occur very infrequently. Now I'll switch over to Excel to show you how the number of values that you collect influences your data. I'm working in the central limit theorem workbook. This is a macro enabled workbook that you can find in the Chapter One folder of the Exercise Files collection. In this case I have 30 normally distributed values and you can see that there is some indication of a central tendency, but that there are some rare values above and below three standard deviations.
If I click the 30 button again you can see that again the values are spread out and there's not really anything approaching a normal curve. With 100 values, we start to see a little bit more of a measure of central tendency. We have more values toward the middle and fewer toward the outside. If I click 100 again for a different data set we see a different version of the same pattern. One more time, same thing, this time shifted a little bit to the left. With 1,000 values, then we start to see a stronger indication of a central pattern.
I'll click it again a couple of times and you can see that there is a stronger measure of central tendency. With 10,000 values the pattern becomes even more clear. Many more of the values are clustered toward the center regardless of how many samples that I select. And I'm clicking several times in a row here. With 100,000 values, again randomly sampled from a population, you see a very strong pattern emerge. The values toward the center outnumber the values toward the edges much more frequently.
The pattern only shifts a little and that's because as you have more sample values the central limit theorem starts to take hold.
- Distinguish between the mean, median, and mode.
- Describe the relationship between variance and standard deviation.
- Identify a nondirectional hypothesis.
- Point out the difference between COVARIANCE.P and COVARIANCE.S.
- Explain correlation.
- Analyze Bayes’ rule.