IntroductionWelcome| 00:04 | Hi! I'm Curt Frye.
Welcome to Excel 2007 Business Statistics.
| | 00:09 | In this course, I'll show you powerful
techniques you can use to analyze
| | 00:12 | your business data to help you make better
decisions about your company and its operations.
| | 00:17 | I'll start by showing you how to use
Excel's built-in tools to manage your data,
| | 00:21 | examine your formulas, and install the
Analysis ToolPak add-in which greatly
| | 00:26 | expands your ability to analyze data in Excel.
| | 00:29 | Next, I'll define important statistical
terms, discuss the two different types
| | 00:34 | of statistical analysis, and provide an
overview of how you can use statistics
| | 00:38 | to help run your business.
| | 00:40 | With that information in mind, I'll
dive into the specifics of statistics in
| | 00:44 | Excel, showing you how to summarize
your data by using charts and graphs,
| | 00:48 | creating frequency distributions for
non-numerical data, and by calculating
| | 00:53 | averages, standard deviations,
and other useful statistics.
| | 00:58 | The last half of the course presents
more advanced techniques, such as analyzing
| | 01:01 | several distinct data distributions,
inferring information about a data set
| | 01:06 | based on a sample, and testing hypotheses.
| | 01:10 | I'll end the course with a look at
forecasting and linear regression, techniques
| | 01:14 | that you can use to plan
the future of your business.
| | 01:17 | Excel 2007 provides powerful statistical
capabilities that help you gain useful insights.
| | 01:22 | I'm sure you'll find this
course will be time well spent.
| | Collapse this transcript |
| Using the exercise files| 00:00 | If you are a premium member of the
lynda.com Online Training Library, or if
| | 00:04 | you're watching this tutorial on a
DVD-ROM, you have access to the exercise
| | 00:09 | files used throughout this title.
| | 00:11 | The exercise files are organized
within a folder named Exercise Files.
| | 00:16 | If you double-click it, you'll see
a folder for each of the chapters.
| | 00:21 | Just to show you what the files look
like within the chapter, we'll open
| | 00:24 | Chapter03, and you can see
that you have the exercise files.
| | 00:28 | So, if you are watching a movie that's
in Chapter 3, all you need to do is look
| | 00:32 | in your Chapter03 exercise files
folder, and you'll be able to follow along
| | 00:36 | exactly as I'm working.
| | 00:38 | If you are a monthly subscriber, or
annual subscriber to lynda.com, you don't
| | 00:42 | have access to the exercise files,
but you can follow along from scratch
| | 00:46 | with your own assets.
| | 00:47 | Let's get started!
| | Collapse this transcript |
|
|
1. Introducing StatisticsIntroducing statistical analysis| 00:00 | Numbers are everywhere, especially in
business where your company's health and
| | 00:03 | future depend on your
ability to make good decisions.
| | 00:06 | You can use statistical methods to
analyze your data to support that work.
| | 00:10 | Statistics is the practice of
collecting, analyzing, and interpreting data.
| | 00:16 | There are two major branches of
statistics, descriptive statistics and
| | 00:21 | inferential statistics.
| | 00:23 | You use descriptive statistics when
you know all of the values in a dataset.
| | 00:27 | For example, if you examine the olives
on a tree, you can count the number of
| | 00:31 | olives, in this case 4,000.
| | 00:33 | Then you can find the number of green,
olives 3,000, the number of black olives,
| | 00:37 | which are 1,000, and then find the proportion.
| | 00:40 | In this case, the green olives are
75% of the total, and the black olives
| | 00:44 | are 25% of the total.
| | 00:46 | For inferential statistics, you take
information from part of your population
| | 00:50 | and extrapolate it to make
guesses about your entire population.
| | 00:55 | So, for example, let's say that
you have a grove of 100 olive trees.
| | 00:59 | You take a sample from five of those trees.
| | 01:02 | You have 21,000 olives, of which
14,070 are green and 6,930 are black.
| | 01:08 | You see that the percentages are
green is 67%, and black is 33%.
| | 01:14 | Now that's true for those five trees,
| | 01:16 | but it might not be in effect--almost
certainly will not be--exactly true for the
| | 01:21 | entire grove of 100 trees.
| | 01:23 | So what you need to ask is how close am
I to the true value, and how confident
| | 01:28 | am I in that estimate?
| | 01:30 | When you ask how close, you're asking how
close your measured value is to the true value.
| | 01:34 | So, for example, black
olives make up 33% of your sample.
| | 01:38 | Well, the total might be 33%, but
most likely it varies a little bit.
| | 01:43 | So, for example, it
might be 33% plus or minus 2%.
| | 01:47 | So the total percentage of black olives
might be between 35% and 31%. Indicating
| | 01:53 | how close gives you a range.
| | 01:55 | Next, you ask, how confident am I in that range?
| | 01:58 | So you're asking what percentage of
random samples taken from this population
| | 02:02 | would fall within the range of, say 31% to 35%.
| | 02:06 | The statement of how confident you are
is what's called a confidence interval.
| | 02:10 | I'm going to switch from olives, and
use a more familiar example perhaps, when
| | 02:14 | it deals with political polling.
| | 02:17 | So if we say that Governor A has an
approval rating of 60% with a margin of
| | 02:22 | error on the survey of 3%,
| | 02:24 | how confident are we of that?
| | 02:27 | Most of the time, the confidence level
that you will find in statistics is 95%.
| | 02:32 | So in other words, 95% of the samples
that we take from the entire population
| | 02:37 | should produce an approval
rating of between 57% and 63%.
| | 02:41 | The other 5% would generate values
outside of that range, for example, 55%,
| | 02:45 | 54%--or on the high side, 70% or 75%.
| | 02:50 | But we know from inferential
statistics that 95% of the time, the approval
| | 02:54 | rating for a survey will be within 3% plus
or minus of the actual true approval rating.
| | 03:01 | So the best way to phrase that
statement is, we are 95% certain that Governor A
| | 03:07 | has an approval rating of 60%, plus or minus 3%.
| | 03:11 | When you read an article online or
pick up a newspaper, try to find the story
| | 03:15 | that doesn't refer to numbers in some way.
| | 03:17 | You probably won't be able to do it.
| | 03:18 | Numbers are everywhere.
| | 03:20 | But once you master the techniques in
this course, you'll have an easier time
| | 03:23 | making sense of your own data and
interpreting data that others give you.
| | Collapse this transcript |
| Understanding statistical terms| 00:00 | Statistics is a wide-ranging field of study.
| | 00:02 | Anytime you describe, categorize,
or interpret data, you're performing
| | 00:07 | statistical analysis.
| | 00:09 | In this movie, I will introduce and define
terms we'll encounter throughout the course.
| | 00:13 | The most basic term in statistics is data.
| | 00:17 | So, what constitutes data?
| | 00:19 | I prefer to define the term "data" as
values that describe an attribute of
| | 00:23 | an object or an event.
| | 00:25 | I've made by definition vague on purpose.
| | 00:27 | Almost anything can be considered data.
| | 00:30 | For example, you could record an
employee's salary, the number of bottles of
| | 00:34 | olive oil your company sold last week, or
how much your company spent on boxes last year.
| | 00:39 | Finally, just a quick note on usage:
| | 00:41 | the term data is technically plural,
and singular is datum if you're referring
| | 00:46 | to a single data point,
| | 00:48 | but you usually only hear people use
the word datum in academic circles,
| | 00:52 | so data is fine in casual conversation.
| | 00:55 | Next, you can talk about a data set.
| | 00:56 | A data set is made up of two components:
| | 01:00 | observational units and variables.
| | 01:03 | The observational units are people or
things or events on which data is collected.
| | 01:08 | As an example, suppose you have a
data set that contains the name, age, and
| | 01:13 | salary of every employee at the company.
| | 01:16 | The employees are the
observational units, and their names, ages, and
| | 01:19 | salaries are the variables.
| | 01:22 | The term "population" refers to
every member of the group of interest.
| | 01:26 | The data set in the example above is
population data, since every employee is in the data set.
| | 01:31 | You might also hear about a
researcher referring to a sample.
| | 01:35 | A sample is a subset of a population.
| | 01:38 | You can select a sample
using any number of rules,
| | 01:40 | but the most statistically valid way
to do so is to select a random sample.
| | 01:44 | I used the word "random" in a previous definition,
| | 01:48 | so let me be explicit about what it
means to say that something is random.
| | 01:51 | A random value is a value
selected entirely by chance.
| | 01:55 | Values generated using an algorithm,
such as the one behind Excel 2007's RAND,
| | 02:00 | function aren't truly random.
| | 02:02 | They are what are called pseudorandom numbers.
| | 02:05 | In other words, the numbers
look random at first glance,
| | 02:07 | but if you know the algorithm and the
starting point, you could predict the
| | 02:10 | sequence of numbers that come out.
| | 02:12 | Pseudorandom numbers are usually
good enough for most applications,
| | 02:15 | but it is possible to generate truly
random numbers using the physical systems
| | 02:20 | by tracking weather phenomena and the like.
| | 02:23 | Next, I'd like to mention data distributions.
| | 02:26 | When you examine your data set, you can
calculate how often individual values
| | 02:29 | occur, and plot those frequencies on a chart.
| | 02:32 | The curve that contains those
frequencies is called the distribution.
| | 02:36 | You might be familiar with the normal
distribution, which is more commonly known
| | 02:39 | as the bell-shaped curve.
| | 02:41 | In this chart, you see that there are
more values near the average and that
| | 02:45 | values are distributed with decreasing
frequency as they move away from the average.
| | 02:48 | There are other distributions in
addition to the normal distribution.
| | 02:52 | I'll cover them later in this course.
| | 02:53 | Anytime you describe an event, you
could assign either a probability of the
| | 02:58 | event will occur or the
odds that the event will occur.
| | 03:01 | Probability and odds are
related and often confused concepts.
| | 03:05 | So let's start with probability.
| | 03:06 | Probability is the likelihood of
something happening that is expressed using a
| | 03:10 | percentage or phrasing, such as X out of Y times.
| | 03:14 | For example, if you flip a coin, it
should come up heads one out of two times.
| | 03:19 | If you state the probability as a
percentage, you would say that a coin is 50%
| | 03:22 | likely to land with the head side up.
| | 03:25 | Odds, by contrast, are an expression of
the likelihood of an event occurring as
| | 03:29 | compared to the number of times it won't.
| | 03:31 | For example, three are six
numbers on a casino die,
| | 03:34 | so the odds of the number four
appearing are five to one against.
| | 03:38 | That is, five ways the number four
won't appear versus one where it will.
| | 03:42 | The final term that I would like to
define is correlation, which is the tendency
| | 03:47 | for changes of one variable to be
related to changes in another variable.
| | 03:51 | For example, a person's height at
maturity is strongly correlated with his or
| | 03:56 | her parent's heights.
| | 03:57 | Now it's very important that you not
confuse correlation with causation.
| | 04:01 | Just because two values follow the same
pattern, doesn't mean the changes in one
| | 04:05 | causes changes in the other.
| | 04:07 | Finally, I'd like to show you this
comic strip from xkcd.com, which points out
| | 04:11 | some of the problems with
correlation and causation.
| | 04:14 | In this movie, I've defined some of
the terms I'll use throughout the course.
| | 04:18 | I'll expand on these definitions as I go along,
| | 04:20 | but in the next movie, I'll show you
how to apply statistics in a variety
| | 04:24 | of situations.
| | Collapse this transcript |
| Using statistics| 00:00 | In this movie, I'll show you a few ways
you can use statistics to analyze your
| | 00:04 | company's data to gain valuable
information about your operations.
| | 00:07 | For example, you can use statistics to
discover the relative popularity of your products.
| | 00:12 | For example, if the Two Trees Olive Oil
Company sells four different infusions
| | 00:17 | of olive oil, then you can create a pie
chart that indicates the share that each
| | 00:21 | of those infusions has toward total sales.
| | 00:24 | So, for example, you could indicate
the numbers for lemon, for the mandarin
| | 00:29 | infusion, and so on.
| | 00:31 | You can also use descriptive
statistics to describe your data.
| | 00:35 | So let's say, for example, that you
have a table that contains a month's
| | 00:38 | worth of sales data.
| | 00:40 | You can use the descriptive analysis
tools to summarize your data, and you get
| | 00:44 | quite a few different measures.
| | 00:45 | You have the mean, which is the
average, standard error, the median.
| | 00:50 | You get an entire range of values, and
| | 00:51 | I'll show you how to use
most of these within the course.
| | 00:55 | Next, you can analyze how
your data is distributed.
| | 00:57 | So let's say once again that you are
looking at daily sales for the Two Trees
| | 01:01 | Retail Shop, and you find that your
average day's sales is 2,000 with a standard
| | 01:08 | deviation, which I'll explain
many times during the course, is 500.
| | 01:12 | That means that the vast majority of
your sales days will have totals between
| | 01:17 | $1,500 and $2,500, with values around
3,000 or 3,200 being much more rare.
| | 01:25 | So if you have a number of extremely
high or extremely low sales days in a row,
| | 01:29 | you need to look at what's going on
to see if you can make any decisions or
| | 01:33 | changes about your operation. Or if you
have higher sales than you expected, try
| | 01:38 | to find out what's going
right and keep doing it.
| | 01:40 | Finally, you can use linear regression
techniques to determine how your sales
| | 01:45 | might improve in the future.
| | 01:46 | For example, if the Two Trees Olive
Oil Company has had a number of articles
| | 01:50 | written about it in the press, you
might want to see how closely the number of
| | 01:54 | articles relates to the number of sales.
| | 01:57 | To indicate that in the chart, you
would draw a trendline or linear regression
| | 02:02 | line to see how well those
two variables move together,
| | 02:05 | in other words, how
strongly they are correlated.
| | 02:08 | Analyzing data using statistical
methods helps you discover information about
| | 02:12 | your company's operations.
| | 02:14 | But assessing that information to
make decisions is a human skill.
| | 02:17 | You need to take your business
environment, competition, and operations into
| | 02:21 | account when you decide how to
use the information you develop.
| | Collapse this transcript |
|
|
2. Learning Useful Excel TechniquesCreating and using an Excel table| 00:00 | Excel 2007 introduced a new tool that helps
you manage your worksheet data effectively.
| | 00:05 | That tool is the Excel table.
| | 00:07 | You'll be using Excel tables throughout
this course, so I wanted to show you how
| | 00:11 | to create them and also how to
refer to them within formulas.
| | 00:16 | To create an Excel table from existing
data, you should have a data list with
| | 00:20 | column headers, no blank rows, and
no extra data around the table data.
| | 00:25 | So, for example here in the cell
range A2 through B15, you'll see that I have
| | 00:30 | dates 1 January through 13
January, and then also sales data.
| | 00:35 | I have nothing: no headers, no
extraneous data in any cell around the table data.
| | 00:41 | That way, when I create the table, the
only thing that would be included is this
| | 00:44 | dat. And also, I have no blank
rows; everything is taken up.
| | 00:50 | It's okay if you have a blank cell.
Say, for example, if I didn't have a value
| | 00:53 | for January 9, that cell could be blank.
| | 00:56 | But both cells could not be blank,
because then that would break the table, it
| | 01:00 | would break the chain, and Excel, when
you create the table, would only include
| | 01:04 | the values in those cells,
including the headers.
| | 01:09 | To create the table, you click any cell
in the data list that you want to make a
| | 01:13 | table, and then on the Home tab of the
Ribbon, click Format As Table, and then
| | 01:19 | select a table style.
| | 01:21 | When you do, Excel displays
the format as Table dialog box.
| | 01:25 | Verify that the range it identified,
A2 through B15, which is correct here--
| | 01:30 | it's in the moving outline. That is correct.
| | 01:33 | Then also My table has headers,
| | 01:36 | that check box should be selected,
because Sales and Date are the headers.
| | 01:42 | Everything looks good, so you can
click OK, and Excel creates the table.
| | 01:47 | A couple of things to notice about the table:
| | 01:48 | First, you will notice that a new
contextual tab appeared, and that is the Design
| | 01:54 | contextual tab under Table tools.
| | 01:56 | That allows you to affect the table as a whole.
| | 01:59 | Also, you'll see that the table headers
have filter arrows that you can use to
| | 02:04 | filter and sort your data within the table.
| | 02:07 | Another advantage is that your table has a name.
| | 02:11 | Now, the name, when it starts, is
going to be something like table 1, table
| | 02:14 | 2, table 12, whatever.
| | 02:16 | But because you are going to be
referring to your table data within formulas,
| | 02:20 | you should rename the table,
so it's more descriptive.
| | 02:22 | That makes your formulas easier to understand.
| | 02:26 | To rename a table, on the Design
contextual tab, click in the Table Name box, and
| | 02:31 | then type the new name.
| | 02:31 | I will call this one
"Daily Sales," and press Return.
| | 02:37 | Now, instead of referring to cells, say
B3 through B15, which is the sales data,
| | 02:43 | you can use the table name.
| | 02:45 | To do that, you type in equal sign,
and in this case I'll just make a SUM
| | 02:49 | formula to find the total of those
values, left parentheses, and then you start
| | 02:54 | typing in the name of the
table, and that's "Daily Sales."
| | 02:59 | You see that Excel, in the formula auto
complete list, displays the name of the table here.
| | 03:05 | While it's highlighted, I can press the
Tab key to accept it. But I'm not quite
| | 03:09 | there, because I want to refer
to the data in the sales column.
| | 03:13 | To refer to a column, type a left square
bracket, and then you'll see a list of
| | 03:19 | the available columns, and other
elements of the table that I won't go into now.
| | 03:24 | I want to use the sales data,
| | 03:25 | so I'll type in S. While Sales is
highlighted, I'll press Tab to accept it, so I
| | 03:31 | have the Sales column.
| | 03:33 | Then I press a right square bracket
and a right parenthesis to close the
| | 03:38 | formula. Everything looks right.
| | 03:39 | Press Enter and Excel creates the formula.
| | 03:43 | Now, the formula that I created
indicates the name of the table and the column
| | 03:49 | from the table that I'm using in the formula.
| | 03:52 | It makes it much easier to understand,
especially if you either come back to a
| | 03:55 | worksheet after you've been away from
it for a while, or if you need to share
| | 03:59 | your workbook with a colleague and they
need to find out how the formulas work.
| | 04:03 | If you add a row of data to your table,
then Excel updates the formula to accept it.
| | 04:08 | So, for example, if I click cell B15 and
then press Tab, Excel creates a new table row.
| | 04:14 | Then I can type the date, 1/14,
which is rendered as 14-January.
| | 04:18 | Then if I type in a value such as 5,000,
Excel updates the formula's result, but
| | 04:24 | notice that the formula has not changed.
| | 04:27 | In other words, it didn't change to
include cell B16 within the formula
| | 04:30 | explicitly, because that's implied
by Daily Sales and the Sales column.
| | 04:36 | Excel tables let you manage your data
easily and refer to your data and formulas
| | 04:40 | without using cell references,
which can be hard to understand.
| | 04:44 | Be sure to rename your table to
reflect the data it contains, which makes your
| | 04:47 | formulas even more effective.
| | Collapse this transcript |
| Installing the Analysis ToolPak| 00:00 | Another element of Excel 2007 that you
will use several times during this course
| | 00:05 | is the Analysis ToolPak.
| | 00:07 | The Analysis ToolPak is an add-in
that comes with Excel 2007, but isn't
| | 00:11 | necessarily available
when you install the program.
| | 00:14 | In this movie, I'll show you how to
install the Analysis ToolPak and to access
| | 00:18 | its tools from Excel.
| | 00:20 | An add-in, as the name implies, is a
group of capabilities you can add to Excel.
| | 00:25 | To make the data Analysis ToolPak
available, first you click the Office button,
| | 00:30 | and then click Excel Options.
| | 00:33 | In Excel Option dialog box, click Add-
Ins to display that page of the Excel
| | 00:39 | Options dialog box, and then you'll see
a list of inactive application add-ins.
| | 00:43 | And right at the top of the list is the
Analysis ToolPak, and its type is Excel add-in.
| | 00:50 | To make the Analysis ToolPak available
to install it with an Excel, go down to
| | 00:55 | the Manage list and if necessary, select
Excel Add-ins--again, that's the type of
| | 01:01 | the Analysis ToolPak--and then click Go.
| | 01:05 | Then in the Add-Ins dialog box,
select the Analysis ToolPak check box.
| | 01:10 | You don't need the Analysis ToolPak -
VBA add-in because that stands for Visual
| | 01:15 | Basic for Applications, and you
won't be doing any programming.
| | 01:19 | Click OK and Excel adds the add-in.
| | 01:23 | If you didn't do a complete
installation of Excel 2007, then you might see a
| | 01:28 | dialog box that asks you if you
want to configure Excel 2007.
| | 01:33 | Indicate that you do, and you might
need to have your Office 2007 installation
| | 01:37 | disk handy for Excel to pull the files on.
| | 01:40 | But once you get the add-in installed,
then you can use it, and after you make
| | 01:45 | the add-in available, it appears on the
Data tab. And you see it here in the New
[00:01: 51.04]
Analysis Group, and it is called Data Analysis.
| | 01:54 | I won't go to all the details of how to
run it, or its output, but what I will
| | 01:58 | do is show you how to access it.
| | 02:01 | First thing you do is click any cell
in the group of data that you want to
| | 02:04 | analyze--in this case the Excel
table--and then click Data Analysis.
| | 02:09 | In the Data Analysis dialog box, you can
select whatever tool that you want to use.
| | 02:15 | So, for example, if you click
Descriptive Statistics and click OK, Excel
| | 02:20 | displays the Descriptive Statistics dialog box.
| | 02:22 | I will stop here for now, but I just
wanted to show you how to install the
| | 02:27 | Analysis ToolPak Add-In, and also how to run it.
| | 02:30 | The Analysis ToolPak Add-In extends
Excel's capabilities significantly.
| | 02:35 | You'll use its functions several times
in this course, so you should install it
| | 02:38 | before you start performing your analysis.
| | Collapse this transcript |
| Auditing formulas step by step| 00:00 | Excel formulas, particularly those
that combine values from more than two or
| | 00:03 | three cells, can be difficult to
comprehend when you encounter them.
| | 00:07 | If you find a formula you don't fully
understand the first time you see it,
| | 00:11 | you can use the Excel Formula Auditing tools
to step through it and get more information.
| | 00:16 | So let's say that you have a
worksheet that is creating a series of
| | 00:21 | seasonal forecasts.
| | 00:23 | You'll see this worksheet later in the
course, so don't worry about the details
| | 00:26 | for now. But what I want to do is show
you how to analyze an individual formula
| | 00:31 | within this worksheet.
| | 00:32 | For example, let's take a look at
the value and the formula in cell C10.
| | 00:39 | The formula is displayed here in the
formula bar, and it's relatively simple.
| | 00:42 | It multiplies the value in B6 by
0.67 and adds that to the value in cell
| | 00:48 | C6 multiplied by 0.33.
| | 00:50 | But let's say that, for whatever reason,
you're having a hard time getting your
| | 00:53 | head around that formula.
| | 00:54 | There are some things that you can do
Excel to give you more information, and allow
| | 00:59 | you to analyze that formula.
| | 01:01 | To do that, you click the Formulas tab,
and then in the Formula Auditing group,
| | 01:06 | you can click Trace Precedents.
| | 01:10 | Clicking Trace Precedents while you
have a cell with a formula selected draws
| | 01:14 | arrows from the cells that provide
values for that formula; in this case, cells
| | 01:19 | B6 and C6 provide values for
the formula in the cell C10.
| | 01:25 | If you want to remove those arrows,
you can click the Remove Arrows button.
| | 01:30 | Similarly, if you want to find the cells
that use a particular cell in their own
| | 01:35 | formulas, then you could click, say,
cell B6 and then click Trace Dependents.
| | 01:44 | In this case, the value in cell B6 is
only used in the formula in cell C10.
| | 01:50 | And again, you can click the Remove Arrows
button to get rid of the tracer arrows.
| | 01:56 | Now let's click cell C10 again and evaluate
the formula in cell C10 one step at a time.
| | 02:04 | To do that, again, on the Formulas tab,
with these cell with the formula you want
| | 02:08 | to examine selected, click Evaluate Formula.
| | 02:12 | The Evaluate Formula dialog box opens, and
you can see the formula as it is spelled out.
| | 02:18 | Now what's going to happen when we
click the Evaluate button is that Excel is
| | 02:23 | going to evaluate the next term in the equation.
| | 02:26 | In this case, it's underlined,
and that is the value in cell B6.
| | 02:31 | So when we click Evaluate, or when
you press Enter because that button is
| | 02:35 | highlighted, Excel changes the value to
067 multiplied by 61,875. And again, that
| | 02:43 | value came from cell B6.
| | 02:46 | When we click Evaluate again, Excel
will evaluate the underlined term 0.67
| | 02:52 | multiplied by that number.
Click Evaluate and you get the result.
| | 02:57 | Next, it's going to
evaluate C6, which has 52,937.
| | 03:01 | So when we click Evaluate, that number
appears there. And now the formula, we'll
| | 03:07 | multiply those two values together.
And finally, Excel is going to add the
| | 03:12 | values 17,000 and change plus 41,000 and change.
Click Evaluate and you get your final answer.
| | 03:20 | If you want to go through the formula
again step by step, you can click Restart
| | 03:24 | to go back to the beginning.
| | 03:26 | When you're done, click Close.
| | 03:29 | The more time you spend acquainting
yourself with the formulas on the workbook,
| | 03:32 | the sooner you'll be able to
manage that workbook effectively.
| | 03:35 | Excel 2007's Formula Auditing tools can
provide powerful insights into what might
| | 03:40 | seem like impenetrable logic.
| | Collapse this transcript |
|
|
3. Summarizing Data Using Tables and GraphicsCreating frequency distributions for qualitative data| 00:00 | When you think of data you store in
an Excel worksheet, you usually think of
| | 00:03 | numbers such as item costs, the total of
a customer's order, or your electric bill.
| | 00:08 | In some cases though, you will work
with qualitative data, such as the name of a
| | 00:12 | product, a state, or a city.
| | 00:15 | In this movie, I'll show you how to
summarize occurrences of qualitative values
| | 00:19 | using Excel's built-in formulas and functions.
| | 00:23 | So I have an Excel table filled with
data. And in this case we have order numbers,
| | 00:28 | and each individual order was for a
single type of olive oil sold by the Two
| | 00:34 | Trees Olive Oil Company.
| | 00:36 | So I have lemon, garlic,
jalapeno, orange, and so on.
| | 00:39 | If you want to count the number of
times each of those infusions was ordered,
| | 00:43 | you need to do two things:
| | 00:45 | The first thing you need to do is to
identify all of the infusions that were
| | 00:48 | ordered as listed in the Excel table.
| | 00:51 | Second, you need to count the
occurrences of each value within the list.
| | 00:56 | To find all of the different values in
column B, you go to the Data tab and then
| | 01:02 | click Advanced Filter.
| | 01:04 | When you click Advanced Filter the
dialog box of the same name appears.
| | 01:08 | The action that you want to take is to
copy the results of the filter to another
| | 01:12 | location, and we'll fill in our
target value here. And Copy to,
| | 01:17 | let's put it starting in cell D1.
| | 01:19 | Now the List range is
currently listed as A1 through B29--
| | 01:26 | that's the entire table; instead, we
should make it B1 through B29. That way we
| | 01:34 | only get the values in
column B, the Infusion column.
| | 01:38 | If we were to leave it as A1, then
every row would be different, so the filter
| | 01:43 | would have no effect.
| | 01:44 | In other words, because each value in
the Order Number column is different, then
| | 01:48 | every row in the table would, by
necessity, be different, or be unique.
| | 01:53 | Then we want to check the
Unique records only check box.
| | 01:58 | What that will do is it will find each of the
different values in the range B1 through B9.
| | 02:05 | When we click OK, Excel displays the list.
| | 02:09 | Now, Infusion is the header,
| | 02:10 | so I'll just format that as a header by
pressing Ctrl+B to make it bold. And now
| | 02:16 | I can create formulas that will count
the occurrence of lemon, garlic, orange,
| | 02:21 | jalapeno, and so on in my range.
| | 02:25 | To do that, we will use a COUNTIF formula.
| | 02:28 | So I will type = countif(,
and now I identify the range.
| | 02:35 | The range is the area of the Excel table
that includes the data that I want to look at,
| | 02:41 | so that is the Infusion column.
| | 02:44 | So I will move my mouse pointer over
the top of the column header in the table,
| | 02:50 | and you'll see that my mouse pointer
changed to a downward-pointing black arrow.
| | 02:54 | When I click Excel identifies
the table, Table1, and the column.
| | 02:59 | Now I can type a comma
and enter in the criteria.
| | 03:03 | The value that I want to look for is in cell D2.
| | 03:05 | In this case, that's lemon. So I'll type D2.
| | 03:09 | Again, I'm looking the value by
reference instead of by naming it directly.
| | 03:13 | You'll see why that's an advantage in a second.
| | 03:16 | I'll type a right parenthesis to
end the formula. It looks correct.
| | 03:20 | When I add the formula, you
see that we have nine occurrences.
| | 03:24 | Now what I'll do is I will copy the
formula down by clicking the cell and then
| | 03:29 | grabbing the fill handle at the
bottom-right corner, and drag it down.
| | 03:35 | When I do Excel extends the
formula, and it does it in this way.
| | 03:39 | Remember that the formula in the first
cell is COUNTIF, Table1 Infusion, D2.
| | 03:43 | D2 is the criteria.
| | 03:48 | When I copied the formula down, Excel
took advantage of the fact that this
| | 03:52 | reference here, D2, is a relative reference.
| | 03:57 | Copying the formula down one row
caused the cell reference at D2 to change to
| | 04:03 | D3, and then in this cell it's
D4, and in this cell it's D5.
| | 04:09 | Now with this data you
can create a column chart.
| | 04:13 | To do that, select the data that you
want to display in the chart, click Insert,
| | 04:19 | and then click the type of
column chart you want to create.
| | 04:23 | Doing so creates a chart with a visual
representation of the data that you just created.
| | 04:28 | You can use the COUNTIF function to
find and count text values in the dataset.
| | 04:33 | By naming the text you want to find in
the formula's criteria, you can summarize
| | 04:37 | qualitative values using numbers that
you can analyze using traditional methods.
| | Collapse this transcript |
| Summarizing data using bar graphs and pie charts| 00:00 | Running a company generates a lot of
data, often more than you can easily
| | 00:03 | comprehend by looking at the individual values.
| | 00:06 | If you want to summarize your data
visually, you can create a chart.
| | 00:10 | In this movie, I'll show you how to
create three common charts: the column chart,
| | 00:15 | a line chart, and a pie chart.
| | 00:18 | You create a column chart
to summarize category data.
| | 00:21 | So, for example, let's say that
you were selling olive oil through
| | 00:25 | three different vectors, and that would be
the web, in your store, and also by catalog.
| | 00:31 | If you wanted to create a column chart,
you can click any cell in your Excel
| | 00:35 | table or in your data range--
| | 00:38 | it doesn't have to be a table--
| | 00:39 | then on the Insert tab in the Charts group,
click the type of chart that you want to create--
| | 00:45 | in this case, a column chart--
and then click the subtype.
| | 00:49 | We'll just click a regular
Clustered Column, 2-dimensional chart.
| | 00:53 | When you do, Excel creates a chart
with the Amount as the title, and Amount
| | 00:59 | corresponds here to the
column header for the data.
| | 01:03 | You can also see the axis here along
this side, indicating the relative values,
| | 01:08 | and then the three different
channels here: Web, Store, and Catalog.
| | 01:13 | If you have data that's spread out over a
time series, you can create a line chart.
| | 01:18 | So I'll click the Line Chart sheet
tab and here you see I have data for the
| | 01:24 | years 2007 through 2010, and
again we are looking at Sales.
| | 01:29 | To create a line chart, you click any
cell in the data collection Insert > click
| | 01:35 | Line, and we'll just click
the basic 2-D line chart.
| | 01:40 | Now, notice that the chart
doesn't look as we expected.
| | 01:43 | Instead of having the years 2007, 2009,
and 2009, 2010 along the bottom, we
| | 01:49 | have 1, 2, 3, and 4.
| | 01:51 | We have a second line here at the
bottom, this blue line called Year, which
| | 01:55 | corresponds to value 2007 through 2010.
| | 01:59 | So what happened was that Excel
misinterpreted the data in the year column
| | 02:03 | and thought that it should be plotted as a
separate series instead of using it as the column,
| | 02:09 | so you need to adjust the chart.
| | 02:11 | To do that, with the chart selected on the
Design contextual tab, click Select Data.
| | 02:17 | Now you can indicate which data that
you want to work as that horizontal or
| | 02:22 | category values. But first, we can get
rid of Year from the Legend Entries--and
| | 02:28 | those are the series of data
that provide data for the lines.
| | 02:32 | So you can click Year and then click Remove.
| | 02:36 | Now in the Horizontal or Category Axis
Labels group, click the Edit button, and
| | 02:41 | then you need to identify
which values you want to use.
| | 02:45 | So select the range of A2 through A5,
click OK, and you see you get your correct
| | 02:53 | values: 2007 through 2010.
| | 02:57 | Now when you click OK, the
chart looks as expected.
| | 03:00 | Finally, I'm going to show
you how to create a pie chart.
| | 03:04 | A pie chart is used to summarize proportion.
| | 03:07 | So, for example, if you had sales for
four different varieties--lemon, mandarin,
| | 03:12 | rosemary, and jalapeno--you can
create a pie chart to see the relative
| | 03:16 | contribution of each of those infusions.
| | 03:19 | To do that, you click any cell in the
data list or table, and then on the Insert
| | 03:24 | tab click Pie and then click the
type of chart you want to create.
| | 03:28 | In this case, just a regular 2-D pie chart.
| | 03:31 | Doing so creates a chart, and you can
see that lemon, mandarin, rosemary, and
| | 03:37 | jalapeno are all represented in the
body of the chart by their color, and the
| | 03:42 | size of the wedge indicates
their relative contribution.
| | 03:45 | So, for example, because the mandarin
value is 2,500, we would expect that to
| | 03:50 | have the largest pie wedge.
| | 03:52 | Mandarin is red. And sure enough,
it does have the largest wedge.
| | 03:56 | If you hover your mouse pointer over one
of the wedges, it gives you information
| | 03:59 | about the data that it summarizes.
| | 04:02 | The series is called Sales, the data
point is Mandarin, and the value is 2,500,
| | 04:08 | and it also indicates the proportion of the
whole that it represents--in this case, 36%.
| | 04:15 | Yes, what you just saw was not an
optical illusion: after awhile the tooltip that
| | 04:19 | appears does disappear
after about 10 or 15 seconds.
| | 04:24 | If you hover your mouse pointer
over another wedge, then you'll get a
| | 04:27 | similar display for there.
| | 04:29 | Charts help you summarize your data
visually, allowing your audience to grasp
| | 04:33 | your data's high-level trends and patterns.
| | 04:35 | After you summarize your data at
this level, you can break it down into
| | 04:38 | smaller chunks, examining various
aspects of your operation using charts or
| | 04:42 | the raw numbers.
| | Collapse this transcript |
| Creating a frequency distribution using the FREQUENCY function| 00:00 | When your business generates a large
data set, such as the details for thousands
| | 00:04 | of orders, you don't always want
to summarize your data by value.
| | 00:07 | In other words, you probably don't
need to count how many wholesale customers
| | 00:11 | ordered one case of olive oil,
two cases, three cases, and so on.
| | 00:14 | Rather than count the occurrences of
each value, you can summarize your data set
| | 00:19 | by dividing values into bins.
| | 00:21 | For example, you could create bins
for orders of up to 10, between 11 and 20
| | 00:27 | cases, 21 to 30, 31 to 40, and so on.
| | 00:32 | To create your frequency formula in
a column anywhere on the worksheet,
| | 00:36 | for ease of use I'll do it right
next to the Bins. Select one more cell
| | 00:41 | than there are bins.
| | 00:42 | So in other words, we have four bins
listed, so I'm going to select five cells.
| | 00:48 | It's very important that those cells remain
selected while you're typing in the formula.
| | 00:52 | Now with those cells selected, you
can create the frequency formula.
| | 00:56 | To do that,
type =FREQUENCY and then your data array.
| | 01:01 | In this worksheet, my Excel
table is called a CaseCount.
| | 01:06 | So I will type in Ca, and when
CaseCount is highlighted in the AutoComplete
| | 01:11 | list, I will press Tab.
| | 01:13 | I want to use the values in the Cases column.
| | 01:16 | So I'll just type C for Cases. It appears.
| | 01:19 | Press Tab and then a right square
bracket to close the table reference.
| | 01:24 | Now I'll type a Tab to move onto the bins
array and the bins occur in cells D2 through D5.
| | 01:30 | The formula looks correct.
| | 01:34 | I will type a right parenthesis.
| | 01:36 | Now, instead of pressing Enter, I'm
going to hold down the Ctrl and Shift
| | 01:41 | keys and press Enter.
| | 01:43 | This is called an array formula.
| | 01:46 | Another Excel commentator
calls these formulas CSE formulas.
| | 01:51 | The reason he does that is because you
hold down the Ctrl key, the Shift key,
| | 01:54 | and then press Enter.
| | 01:56 | Excel creates an array of formulas.
| | 01:59 | You can tell they are in an array, because
in the formula bar you can see that the
| | 02:03 | formula is outlined in curly brackets.
| | 02:07 | So that's how you know that
you created an array formula.
| | 02:10 | If you click an individual cell, such as here,
you'll see the same formula, here, and so on.
| | 02:16 | What the array formula is doing is
referencing the bin in the same relative
| | 02:21 | position in its column as the
cell that contains the formula.
| | 02:25 | So, for example, here for 36, it is
counting the number of orders of cases up to 10.
| | 02:32 | Then 32 is the count of orders up to 20.
| | 02:34 | 37 is the count of orders up to 30.
| | 02:37 | 35 is the count of orders up to 40.
| | 02:41 | Then 60 is the number of
orders of more than 40 cases.
| | 02:45 | Now one column to the right, I'm going
to show you what happens if you create
| | 02:49 | the formula incorrectly just by pressing Enter,
| | 02:51 | so you'll be able to recognize what
happened, just in case you get a wrong.
| | 02:55 | So I will select cells F2 through F6,
=FREQUENCY, and I'll just go through this
| | 03:02 | quickly, CaseCount and then Cases, comma,
and the bins are still in D2 through D5.
| | 03:10 | And instead of pressing Ctrl+
Shift+Enter, I'll just press Enter.
| | 03:15 | When I do, Excel creates a formula,
but it only looks at the first bin.
| | 03:20 | It doesn't create the formula
for the rest of these cells.
| | 03:24 | So there are two things to look
for when you create an array formula:
| | 03:27 | the first is that Excel fills the
entire array of selected cells with the
| | 03:32 | formula and secondly the formula
is surrounded by curly brackets.
| | 03:37 | That's the indicator that you
have created an array formula.
| | 03:41 | Now that you have your data, you can use
it to create a chart or whatever you like.
| | 03:45 | The FREQUENCY function lets you
summarize your data by dividing your values into
| | 03:49 | bins and counting the number
values that fall into each bin.
| | 03:52 | These summaries help make sense of
large data sets without forcing you to create
| | 03:56 | filters or group your data
using more cumbersome procedures.
| | Collapse this transcript |
| Creating a histogram| 00:00 | In the previous movie, I showed you
how to use Excel's FREQUENCY function to
| | 00:04 | divide your data into bins and to count the
number of values that fell into each bin.
| | 00:08 | In this movie, I'll show you how to
create a chart called a histogram, which
| | 00:12 | summarizes that bin information visually.
| | 00:15 | To create a histogram, you need
to use the Analysis ToolPak.
| | 00:18 | If the Data Analysis button doesn't
appear on the Data tab of the Ribbon at the
| | 00:23 | far right side, then you
need to install the add-in.
| | 00:27 | I explained how to install the
add-in in the movie in Chapter01.
| | 00:30 | To begin click of the Data Analysis button--
| | 00:33 | again, that's on the Data tab--and
in the Data Analysis dialog box, click
| | 00:39 | Histogram and then click OK.
| | 00:42 | In the Histogram dialog box, select the Input
Range, which in this case is here in the table.
| | 00:50 | So click the Contract dialog button,
and then in the worksheet, I'll just use my
| | 00:55 | scroll wheel to move up one row.
| | 00:59 | You want the data in the Cases column
of the table, so move your mouse pointer
| | 01:04 | over the column header, which says
Cases. And when the pointer turns into a
| | 01:08 | downward-pointing black arrow,
click, and Excel selects B2 through B201.
| | 01:15 | Click the Expand dialog
button and then click Bin Range.
| | 01:19 | Now something you need to watch out
for is that you don't want to select the
| | 01:24 | column label for Bins here, D2 through D5,
because what you did when you selected
| | 01:31 | this column in the table was you
selected B2 through B201, but Excel did not
| | 01:36 | select D1, which contains the column label.
| | 01:40 | So for the Bin Range, you
want to select D2 through D5.
| | 01:46 | When you select those, Excel gives you
that range, and you don't have to worry
| | 01:50 | about the column label being included.
| | 01:53 | If you did select data and included
column labels in it, then you would
| | 01:57 | check the Labels box.
| | 01:59 | We didn't in this case, so we don't have to.
| | 02:02 | I can also select the Chart Output box.
| | 02:06 | Selecting that box will cause Excel to
create a histogram chart in addition to
| | 02:11 | the bin frequency distribution data.
| | 02:14 | I do want to put the output on a
New Worksheet. Everything looks good,
| | 02:17 | so I'll click OK and have
Excel create my worksheet.
| | 02:21 | Excel summarizes the data by bin.
| | 02:24 | The first one contains orders up to 10 cases.
| | 02:27 | Next one is 11 to 20, 21 to 30,
31 to 40, and finally, more than 40.
| | 02:34 | It summarizes that data in this histogram,
which is a column chart broken down by category.
| | 02:40 | Histograms summarizes your data visually
by creating a chart that depicts how the
| | 02:43 | values fall into data
ranges, or bins, that you define.
| | 02:47 | When the ranges of your values
matter more than the individual values for a
| | 02:50 | particular analysis, use a
histogram to summarize that data.
| | Collapse this transcript |
| Calculating a running total| 00:00 | When you summarize data in an Excel
table, you can find totals, averages, and
| | 00:04 | counts of data in each column.
| | 00:06 | One feature that's not built-in,
however, is the ability to calculate the
| | 00:10 | cumulative percentage that
data contributes to the total.
| | 00:13 | In this movie, I'll show you how to
calculate a running total and also a
| | 00:17 | running contribution.
| | 00:19 | In the table on the screen right now,
the third column contains the total of the
| | 00:23 | values in the current row
and all the rows above it.
| | 00:26 | So, for example, in cell D3, we have
the value 327, and the formula--which I'm
| | 00:32 | not going to show you yet--calculates
the total of all the cells in column B in
| | 00:38 | the same row and above.
| | 00:40 | So, for example, the first cell
contains the value 199, and the second contains
| | 00:45 | the value 327, which is
199+128 and so on, down the line.
| | 00:52 | The formulas in column D--which
again, I'm not going to show you yet--
| | 00:56 | calculate the percentage that this
running total is of the total of all
| | 01:01 | values in column B.
| | 01:02 | Now I am going to delete these
formulas and show you how to create them.
| | 01:10 | First thing you do is click cell C2--
that is the first cell in the Running Total
| | 01:15 | column--and you type this formula:
"=sum(B2", and then press F4. Pressing F4
| | 01:28 | changes the relative cell reference, B2,
to the absolute reference of B2. And what
| | 01:34 | that means is that--well first,
functionally it adds Dollar signs in front of the
| | 01:38 | B and in front of the 2.
| | 01:40 | The B is the column reference, the 2 is
the row reference, and the Dollar sign
| | 01:45 | means that that reference will not
change when you copy it to another cell.
| | 01:50 | And I'm going to do that in a moment.
| | 01:52 | So we'll just see how it works, because
the sum is of the range from B2 to B2.
| | 02:01 | And then I'll type a right parenthesis.
| | 02:02 | Now what's going to happen is this
reference, the first B2, is going to stay the
| | 02:07 | same regardless of where the formula is
copied, and the second B2 will vary.
| | 02:13 | So, in other words, when the formula gets
copied to this cell it will be B3 here,
| | 02:17 | here it will be B4, and so on, because
you're moving down one row at a time, and
| | 02:22 | that means the row number will increase.
| | 02:25 | When I press Enter, Excel fills the
entire table column with formulas.
| | 02:30 | You can see here that the second data column
contains =SUM($B2:$B3), ($B2:$B4), ($B2:$B5), and so on.
| | 02:41 | The error indicator is there
because your formula omits adjacent cells.
| | 02:48 | In other words, what it saying is that
you have numerical values in every cell
| | 02:52 | below B4, so it thinks that you
might want to include them in your total.
| | 02:58 | You don't, and I'll show you how to turn that
error indicator off at the end of the movie.
| | 03:02 | Now let's do the same thing for a
running contribution, in other words the
| | 03:06 | percentage of the total--which is 2870--
that in individual cell value--199, 327, or
| | 03:14 | so on--contributes to that total.
| | 03:18 | To do that, type "=C2"--and this time
we are going to leave the reference
| | 03:24 | relative, so it will change--divided
by the sum of the Amount column's values.
| | 03:32 | So the name of the table is OrderSummary,
so I'll type that and then accept the
| | 03:38 | AutoComplete offering. Left square
bracket. And we want to use the values in the
| | 03:44 | Amount column, so I'll type an A
and then Tab to except the formula
| | 03:49 | AutoComplete. Right square bracket.
Right parenthesis to close. Press Return
| | 03:55 | and we get our cumulative percentage.
| | 03:58 | So 199 is 6.93% of 2,870, 327
is 11.39% of 2,870, and so on.
| | 04:09 | Now I'll show you how to get
rid of the error indicators.
| | 04:11 | There are number of ways to do it, but
in this context the quickest way is to
| | 04:15 | select any cell that displays that
error--and the indicator is the green
| | 04:19 | triangle in the upper left-hand corner--
click the Error Options button that
| | 04:23 | appears, and then click Error Checking Options.
| | 04:28 | On the Formulas tab, you can go down to
the Error Checking Rules area and clear
| | 04:33 | any check box next to an error
that you don't want to display.
| | 04:37 | In this case, we don't want to display an
error for formulas which omit cells in a region,
| | 04:43 | so we'll clear that check box,
| | 04:45 | click OK, and the indicators disappear.
| | 04:48 | Finding a running total and running
percentage help you visualize how your
| | 04:51 | revenue accumulates over time.
| | 04:53 | If you operate a retail business, you
might find that the first 10 months of
| | 04:57 | the year contribute about half of
your revenue, with the remaining half
| | 05:01 | generated during the last two months
of the year, which is the traditional
| | 05:04 | gift giving season.
| | Collapse this transcript |
| Summarizing data using a PivotTable| 00:00 | You can use Excel to organize your data
in many different ways, but the standard
| | 00:04 | Excel worksheet doesn't let you
reorganize your data on the fly.
| | 00:07 | However, Excel does have a built-in
tool that lets you change your data's
| | 00:11 | organization and summary dynamically.
| | 00:13 | That tool is called a PivotTable.
| | 00:16 | To create a PivotTable, first you
need a data list, such as an Excel table.
| | 00:21 | The Excel table I have here has the
data for year, month, sales channels--such
| | 00:25 | as web, catalog, or store--and then the
sales for the particular vector during a month.
| | 00:32 | To create a PivotTable, you select any
cell in your data list, and then on the
| | 00:38 | Insert tab, click the PivotTable button.
| | 00:42 | In the Create PivotTable dialog box,
you make sure that Excel has correctly
| | 00:47 | identified the table or range that you
wanted to use to create the PivotTable--in
| | 00:52 | this case it's Table1, which is outlined
here--and then you choose where you want
| | 00:56 | the PivotTable to be placed.
| | 00:58 | I always place my
PivotTables on a new worksheet.
| | 01:01 | That way the PivotTable isn't
competing with the original data.
| | 01:05 | Now if you had an existing
worksheet that had some charts or some other
| | 01:09 | information that you used as part of
a dashboard, then you could certainly
| | 01:12 | create your PivotTable there.
| | 01:14 | But for the most part, because
PivotTables can get kind of large, you will want
| | 01:18 | to put it on its own worksheet.
| | 01:20 | Everything looks good, so I will click OK, and
Excel creates the PivotTable on a new worksheet.
| | 01:26 | Now, I can add fields to the PivotTable.
| | 01:28 | So let's say, for example, that I just
wanted to summarize sales for each year.
| | 01:32 | I'll put those in the Row Labels area.
| | 01:36 | When I do, Excel updates the PivotTable
| | 01:39 | so that the row headers are all of the
years that are available within the data
| | 01:43 | set, and now I can add that the
sales channel to the Column Labels area.
| | 01:50 | So I have 2008, 2009, 2010;
| | 01:53 | Store, Web, and Catalog. And finally, I can
add the Sales values to the Values area.
| | 02:02 | So in 2008 I had over $61,000 in
catalog sales, $82,000 in in-store sales, and
| | 02:10 | over $143,000 in web sales.
| | 02:14 | I also have a grand total for each of
the years: 287,000, 277, 285, and so on.
| | 02:20 | If I add another layer of organization
within the PivotTable--such as by adding
| | 02:25 | a month, which I'd do by dragging it
from the Field List area to the Row Labels
| | 02:30 | area, underneath Year--you will see
that now 2008 has individual results for
| | 02:36 | January, February, and so on.
| | 02:38 | And the grand total is here at the top,
287,872. That's the Grand Total for 2008;
| | 02:45 | 2009, 277; and so on.
| | 02:50 | And then I also have the total for each month.
| | 02:53 | So in January we had these values for
catalog sales, store and web, and they add
| | 02:58 | up to this total, and the
same for the other 12 months.
| | 03:01 | If I scroll down to the bottom of the
PivotTable, we will see that we have
| | 03:06 | grand totals for all of the columns
Catalog, Store, and Web, and also a grand
| | 03:12 | total of those values.
| | 03:13 | If you want, you can hide the
subtotals and grand totals or move them around.
| | 03:17 | To do that, you click any cell in
the PivotTable, and then on the Design
| | 03:21 | Contextual tab, you use the
tools in the Layout group.
| | 03:25 | So, for example, if you wanted to
hide your subtotals, you can click Do Not
| | 03:29 | Show Subtotals. And I personally
prefer to show subtotals at the bottom of a
| | 03:34 | group, as opposed to the top.
| | 03:35 | So that's one change I always make.
| | 03:37 | I always click Show all
Subtotals at the Bottom of the Group.
| | 03:42 | So now instead of having 2010 and the
subtotals here, I have the 2010 subtotals
| | 03:50 | at the bottom and the grand
totals were in the usual place.
| | 03:54 | If you want, you can turn grand
totals, which are at the end of a column or
| | 03:59 | the end of a row, on or off. Or you can
turn them on for rows or on for columns only.
| | 04:04 | So, for example, if I only want the
totals for the columns, I can click On for
| | 04:09 | Columns Only, and the grand totals at
the right edge--which would summarize each
| | 04:14 | month or each year--disappear. But I
still have my grand totals at the bottom,
| | 04:20 | which are for my three different sales vectors.
| | 04:22 | Now I am going to take the Month field
out of the PivotTable again, by dragging
| | 04:27 | it from the Row Labels area to the
Fields area. And I'll scroll back up.
| | 04:32 | Right now, I'm summarizing the data in
my PivotTable using the SUM function;
| | 04:36 | in other words, I am finding the
arithmetic sum of all the values within each of
| | 04:40 | these categories: catalog sales
for 2008, 2009, 2010, and so on.
| | 04:45 | If I want, I can add a second instance of
the sales values to the Values field by
| | 04:51 | dragging Sales down to the Values area.
And you see I have Sum of Sales1 and Sum
| | 04:58 | of Sales2, and that organization
is reflected here in the Values box.
| | 05:05 | What I'll do is change the summary
operation for Sum of Sales2, so that instead
| | 05:10 | of the arithmetic sum--the total--we're
going to have the percent contribution to
| | 05:15 | a column or to a row.
| | 05:17 | So I will click Sum of Sales2
and click Value Field Settings.
| | 05:24 | In the Value Field Settings dialog box,
I'll click Show values as. And then in
| | 05:28 | the Show values as list, I will click % of row.
| | 05:35 | Now when I click OK, Excel indicates
that the first value is 21% of the total,
| | 05:41 | the second is 28.59% of the total,
and the third is 49.96% of the total.
| | 05:49 | In other words, your web sales--which were
$143,000--contributed almost 50% of the total value.
| | 05:57 | Now those values are little but hard to
read because there are no commas, so I
| | 06:00 | am going to change the Number
format for the Sum of Sales field.
| | 06:04 | To do that, I will go down to Sum of
Sales, click the down arrow, and click
| | 06:09 | Value Field Settings.
| | 06:11 | Now I'll click Number Format, and we
can use the Format Cells dialog box's
| | 06:16 | controls to change the format.
| | 06:18 | In this case, I will change it to
Currency. And I don't have any Decimal places--
| | 06:23 | no sense, in other words--I will change
Decimal places to 0. Click OK. Click OK
| | 06:30 | again, and my values have changed,
and now they are easier to read.
| | 06:34 | PivotTables give you enormous
flexibility when you analyze your data.
| | 06:39 | If you take the time to learn how to
use them effectively, you'll find that you
| | 06:42 | can develop useful insights
into your data very quickly.
| | 06:45 | For more information on analyzing data
with PivotTables, see my course "Excel 2007:
| | 06:51 | Pivot Tables for Data Analysis"
elsewhere in the lynda.com Online
| | 06:55 | Learning Library.
| | Collapse this transcript |
|
|
4. Describing Data Using Numerical MethodsCalculating mean, median, and mode| 00:00 | When you analyze a data set, you can
often gather a great deal of information
| | 00:04 | about the data by finding its central values.
| | 00:07 | In this movie, I'll show you how to
calculate three different types of central
| | 00:10 | values: the mean, the median, and the mode.
| | 00:14 | The common word for the arithmetic
mean of a data set is the average.
| | 00:18 | It's the sum of the values
divided by the number values.
| | 00:22 | So, for example, in this table here,
I have seven days and then the sales
| | 00:27 | totals for those days.
| | 00:29 | If I want find the average in cell B1, I
can type "=average(" and then the name of
| | 00:37 | the table, which is DailySales. So I
will start typing it and then press Tab to
| | 00:43 | accept the formula AutoComplete value.
| | 00:46 | Now I want the Sales column, so I'll
type a left square bracket, start typing
| | 00:51 | Sales, Sales is highlighted,
| | 00:53 | Tab to accept it, right square bracket, and
then a right parenthesis to close the formula.
| | 00:59 | When I press Enter, the value appears:
3,825.9. Averages can be misleading if you
| | 01:07 | have one or two disproportionately
large values in a data set, such as days
| | 01:11 | where customers made unusually big orders.
| | 01:14 | So let's say, for example, that on
March 1, instead of having $5,194 in sales
| | 01:19 | that a customer had ordered
enough to push the total up to $20,000.
| | 01:24 | When you press Enter, you see that the
mean has gone up to almost $6,000, when all
| | 01:30 | of the other values except for that
one 20,000 are still in the $3,000-$4,000
| | 01:36 | range for the most part.
| | 01:37 | So always be careful when you're
examining your data to make sure that you don't
| | 01:43 | include any really large values,
anything that doesn't seem to fit with the rest
| | 01:47 | of the data pattern.
| | 01:48 | Now if those values occur a couple of
times, in your estimation, well it's not
| | 01:52 | unusual to have days where there are
huge orders, then certainly include them;
| | 01:56 | that's part of the business knowledge
that you have about your operation. But if
| | 02:00 | it's just one instance, then don't
worry about it; you should exclude it.
| | 02:04 | So I guess the question is how do you do that?
| | 02:06 | Well you do that by filtering
your value out of the table.
| | 02:11 | So, for example, if you want to
create a filter, you could click the Sales
| | 02:14 | column's Filter arrow, click Number
Filters that are Less Than, and let's say
| | 02:20 | 10,000 as a reasonable upper bound.
| | 02:23 | When we click OK, Excel filters
the table, but you'll notice that the
| | 02:29 | mean didn't change;
| | 02:30 | it still looks at all the values.
| | 02:31 | What we can do, however, is add a total
row at the bottom of the table, and that
| | 02:35 | will display the average
of only the visible values.
| | 02:39 | To do that, you click any cell in the
table, and then on the Design Contextual
| | 02:43 | tab, check the Total Row box.
| | 02:47 | When you do that, it displays a total,
but if you click the cell in the Total
| | 02:52 | Row, click the down arrow, and then
click the AVERAGE function, Excel displays a
| | 02:58 | value of 3,598, which is the
average of only the visible values.
| | 03:02 | If you remove the filter, by clicking
the Filter arrow and then clicking Clear
| | 03:06 | Filter from Sales, then Excel's AVERAGE
function here in the Total Row gives the
| | 03:11 | same answer as the formula in cell B1.
| | 03:14 | Next I would like to
talk about the median value.
| | 03:17 | The median value of a data set is the
value in the middle of the values if they
| | 03:21 | were sorted in ascending order--
that is, from lowest to highest.
| | 03:25 | To discover the median in the data set,
you use the MEDIAN function, so you type
| | 03:32 | "=MEDIAN(" and then the
name of the table, which is CustomerCount,
| | 03:38 | and we are using the data in
the Customers field, or column.
| | 03:44 | Press tab to accept the field name.
Right square bracket. Right parenthesis.
| | 03:49 | Everything looks right.
And we get the median value of 116.
| | 03:54 | Now the values aren't sorted in the
table by customers, so to do that--to
| | 03:58 | make sure that 116 is the proper
value--I will sort these values from
| | 04:03 | smallest to largest. And you see that
the value that appears in the middle is
| | 04:07 | 116, which is the median.
| | 04:10 | The median value is not affected by
extremely large or extremely small values.
| | 04:15 | So, for example, if I were to change
the value 130 to 500, or even 5,000, and
| | 04:23 | press Return, the median doesn't
change because 116 is still in the middle.
| | 04:27 | The reason that happened is because
5,000 is still in the same half, if you were,
| | 04:33 | of the data set. So we have 49, 51, to
74 that are all less than 116, and 123,
| | 04:39 | 126, and 5,000 that are all greater than 116.
| | 04:43 | If I were to change 5,000 to 1 and
press Return, it would move to a different
| | 04:47 | half of the data set, and you'll notice
that 74 is the new median, which we can
| | 04:53 | verify by resorting the data from
smallest to largest. And you see that 74 is
| | 04:59 | now in the median position.
| | 05:02 | Finally, I would like to talk about the mode.
| | 05:04 | The mode of a data
collection is the most common value.
| | 05:08 | In Excel 2007, if multiple values are
tied for the highest frequency, the program
| | 05:12 | picks the lowest value as the mode.
| | 05:15 | So, for example, in the data set
here, if I want to find the most common
| | 05:18 | value in the Customers column, type
"=MODE(" and then the name of the table,
| | 05:26 | which is "HourlyCustomers," and we're
looking in the Customers column, using
| | 05:33 | that data. Right square bracket.
Right parenthesis to close. Return. And the
| | 05:39 | MODE formula returns a value 28.
| | 05:41 | And if we sort this data like we did
last time, sorting from smallest to largest,
| | 05:46 | then we'll see that indeed the
number 28 occurs three separate times.
| | 05:50 | The number 14 occurs twice. And because
it's lower than 28, if I were to change
| | 05:54 | one of the 28s to another value that
doesn't occur--let's say 30, which doesn't
| | 05:59 | appear anywhere else--and hit Return,
then the mode sees that there are two
| | 06:03 | occurrences of 28, two occurrences of
14, and using its own rules, it selects the
| | 06:09 | lower value, 14, as the mode.
| | 06:12 | Calculating modes is only useful for
what's called "discrete data," where the
| | 06:16 | values are whole numbers such as 10, 25, or 50.
| | 06:20 | Continuous data, where the
measurements could have decimal components,
| | 06:24 | either won't have the most common
value or there will be a duplication
| | 06:28 | that occurs entirely by chance.
| | 06:30 | Most of these statistical techniques you
will use in this course rely heavily on
| | 06:34 | the mean, or average, value of a data set.
And unless your data set contains one
| | 06:39 | or two extremely large or extremely
small values that distort the average, it's
| | 06:43 | the most useful value.
| | 06:44 | If you're interested in the value
that would be in the middle of a range if
| | 06:47 | you sorted the values in ascending or
descending order, calculate the median.
| | 06:52 | Calculating the mode, the most common
value in a data set, could point out the
| | 06:55 | most common answer in a survey and
help you discover information about your
| | 06:59 | customers and their
opinions about your business.
| | Collapse this transcript |
| Calculating a moving average| 00:00 | When you track data over time, you might want
to find the average of the last three values.
| | 00:04 | For example, one measure of your
company's recent performance might be its
| | 00:08 | average sales over the past three months.
| | 00:10 | You can calculate a moving average
in a worksheet by creating a series of
| | 00:14 | formulas by hand, or by using a
tool in the Data Analysis ToolPak.
| | 00:18 | To use the tool in the Data Analysis
ToolPak, you select the Data tab on the
| | 00:23 | Ribbon and then click Data Analysis.
| | 00:27 | If Data Analysis doesn't appear, then
you need to install the Analysis ToolPak,
| | 00:33 | which is available as an add-in.
| | 00:35 | I showed you how to install that
add-in in a movie in chapter 1.
| | 00:40 | With it installed, click Data
Analysis, and then in the dialog box that
| | 00:44 | appears, scroll down and click Moving
Average. Click OK and the Moving Average
| | 00:51 | dialog box appears.
| | 00:53 | Now you can select your Input Range.
| | 00:54 | These are the values that Excel will
use to calculate the moving average.
| | 00:58 | In this case, I want to use cells C2
through C13, and I did not select the Sales label,
| | 01:07 | so I don't need to check that box.
| | 01:09 | Next, I need to set the Interval.
| | 01:12 | The interval is the number values that
you want to include in your moving average.
| | 01:17 | In this case, I will click 3.
| | 01:19 | That means that the moving average that
will be calculated for these three cells
| | 01:25 | will appear in cell D4.
| | 01:27 | The moving average for these three
cells, C6 through C4, will be displayed
| | 01:32 | in cell D6, and so on.
| | 01:34 | Now I'll select my Output Range, and
those are the cells here, D2 through D13.
| | 01:44 | I don't want a chart.
| | 01:45 | I don't have to worry about standard errors.
| | 01:47 | We will cover those later in the course.
Now I will click OK, and Excel adds a
| | 01:51 | new column to my table, and it
also displays the moving averages.
| | 01:55 | The first thing I'll do is
change the name of the table, Column.
| | 01:59 | To do that, you click the cell where the
table column header appears, and I will
| | 02:04 | call it "Moving Average," and double-
click the right edge of the column header to
| | 02:13 | expand it, so it displays the
entire contents of the cell.
| | 02:16 | Now we can look at the moving average formulas.
| | 02:18 | The first two cells in the column,
because I had an interval of three, do not
| | 02:23 | have a value. That's because there
aren't three values that you could use to
| | 02:27 | calculate your moving average. But the
value in D4 is a moving average, and what
| | 02:32 | Excel did was it created
the formula =AVERAGE C2 to C4.
| | 02:37 | So it finds the average of these
cells and puts that value here. And it
| | 02:43 | continues on, C3 through C5,
C4 through C6, and so on.
| | 02:49 | Moving averages offer great insights
into your data, but you should adjust the
| | 02:54 | number of values you include based
on your data set's characteristics.
| | 02:57 | For a call center that receives
thousands of calls a day, you might take an
| | 03:01 | average of the last seven days.
For a smaller company that receives five or ten
| | 03:05 | outside orders a day, perhaps a moving
average based on monthly the data would
| | 03:08 | be more appropriate.
| | Collapse this transcript |
| Calculating a seasonal forecast| 00:00 | Many businesses' sales patterns follow
predictable patterns based on the time of year.
| | 00:05 | Garden stores tend to sell the most
plants during the spring, and toy stores
| | 00:08 | tend to sell a huge percentage of their
yearly total during the winter gift-giving season.
| | 00:14 | If you believe your data follows a
seasonal trend, you can create a series of
| | 00:18 | formulas to forecast your
company's sales based on those patterns.
| | 00:21 | In this example, I have quarterly data,
and because it's broken up by quarter,
| | 00:26 | there will possibly be a pattern
that would apply based on each quarter's
| | 00:30 | positioning within the year.
| | 00:32 | We will be comparing Q1 to Q1;
for example, Q1 2006 to Q1 2007,
| | 00:37 | do the same for Q2 and Q2 and
so on. But if your data tracks calls to a
| | 00:43 | call center, you could also compare Monday
to Monday, Tuesday to Tuesday, and so on.
| | 00:47 | There are many ways to estimate the
effect of seasons on corporate revenue:
| | 00:51 | one way to do it is to create a
forecast based on the actual sales and forecast
| | 00:55 | from the last two quarters.
| | 00:57 | There's no precise way to decide what
way you should get to the two factors, but
| | 01:01 | a reasonable way to start is to
multiply the sales from the quarter a year ago
| | 01:06 | by two thirds and the sales from the
same quarter two years ago by one third;
| | 01:10 | that way you're giving more weight
to recent performance, but you're not
| | 01:13 | discounting what has happened earlier.
| | 01:15 | For the first four quarters, which would
be cells C2 through C5, there can be no
| | 01:21 | forecast because there is no previous data.
| | 01:24 | For the second set of quarters, the
quarters for the second year, you only have
| | 01:28 | the previous sales to go on,
so you create that formula.
| | 01:32 | We have equal =B2. And again, this is Q1
2007, and the only applicable data you
| | 01:39 | have us from Q1 2006, so
that's how you create your formula.
| | 01:44 | If you copy that formula down, by
dragging the fill handle, until it covers
| | 01:48 | Q4, then you see that you've repeated the
values from the previous quarters a year ago.
| | 01:55 | Now starting with your third year, you
can begin to use your weighted formula.
| | 02:00 | To do that, you type equal, and then
we're taking the value from two years ago--
| | 02:05 | same quarter--multiplied by one third.
So that would be B2 times 0.33, plus the
| | 02:12 | value from the same quarter one year ago,
and that is the cell B6, outlined in
| | 02:18 | green, and we will multiply that by 0.67.
| | 02:22 | When we press Return, we get a forecast
of 58,925, which is reasonably close to
| | 02:28 | the actual sales of 63,822.
| | 02:32 | Now if we copy the formula down, Excel
extends the formula and uses the values
| | 02:40 | from four rows above and from eight rows above.
| | 02:43 | So, for example, here for Q2 2008,
we're looking at cell B7, which is Q2
| | 02:50 | of 2007, and B3, which is Q2 of 2006, and
the value 61,000--reasonably close. Not great.
| | 03:00 | Then we have 58,000 and 70,000 as
forecasted, and again, there will be variations in
| | 03:05 | business cycles, and so on.
| | 03:07 | What you're looking for over time is
a formula that best approximates your
| | 03:12 | business's performance, and there's no real
way to tell you what it's going to be the start;
| | 03:16 | you will just have to find it on your own.
| | 03:18 | The data in this table presents a
challenge because the quarters for the first
| | 03:21 | three years show little growth.
| | 03:23 | Seasonal forecasts start working well
when you have at least five years of data
| | 03:27 | to draw on, but keep adjusting the
formula's weights as times goes by to see if you
| | 03:32 | can find a good fit.
| | Collapse this transcript |
| Calculating percentiles and quartiles| 00:00 | Large data sets often contain a wide
range of values, so it can be useful to
| | 00:04 | find the top and bottom percentages of the data.
| | 00:07 | For example, if you track sales data,
you might want to discover the value that
| | 00:11 | delineates the top 20% of your
sales days from the bottom 80%;
| | 00:15 | alternatively, you might want to find the
value at each quarter, 25%, 50%, and 75%.
| | 00:22 | These values are called quartiles because
they divide the data set into four quarters.
| | 00:28 | To calculate a percentile in Excel
2007, you use the PERCENTILE function.
| | 00:34 | So, for example, if you wanted to find
the 50th percentile--that is the value
| | 00:38 | where half the data set is above the
value and half the data set is below--you
| | 00:43 | would create the formula =PERCENTILE,
and then you need to name the array. And
| | 00:50 | the array in this case is called Table1--
it's the Excel table at the left side of
| | 00:56 | the screen. And we are looking
at the data in the Sales column,
| | 01:00 | so I will type S and accept Sales
from the formula AutoComplete list.
| | 01:04 | Type a right square bracket and then a comma.
| | 01:07 | The second argument in the PERCENTILE
function is K, and K is simply the percentile.
| | 01:13 | So I'm looking for 50%. Type in 50%.
Right parenthesis to close the formula, and
| | 01:21 | you see that the value 1,185 is the value
where half the data set as above and half
| | 01:27 | the data set is below.
| | 01:29 | If you wanted to calculate the 90th
percentile, you would create a similar
| | 01:32 | formula: =PERCENTILE, Table1, Sales,
90%, right parenthesis. Everything looks
| | 01:44 | good. Hit Return and that value is 2,167.
| | 01:49 | So 10% of the values in the data
set will be 2,167 or above. 90% will be
| | 01:55 | less than that amount.
| | 01:56 | If you want to see the exact values,
that represent, say the top 10% of a data
| | 02:02 | set, you can create
what's called a top 10 filter.
| | 02:05 | To do that, you go into your Excel
table and click the Filter arrow of the
| | 02:10 | column that you want to filter--in
this case it's Sales data--point to Number
| | 02:15 | Filters, and click Top 10.
| | 02:18 | In this case, we are looking for the top
10%, so top is correct, 10 is correct,
| | 02:23 | but we need to change items 2%.
| | 02:27 | When we do, click OK, and Excel
displays the values that represent the top 10%
| | 02:32 | of values from this data set.
| | 02:34 | You can click the Filter
arrow again and clear the Filter.
| | 02:39 | As I mentioned before, quartiles
represent the 25th, 50th, and 75th percentile,
| | 02:45 | plus the lowest and highest values.
| | 02:48 | So let's say that you wanted to find
the 25% level, which is the first quartile.
| | 02:53 | To do that, you create a formula using
the QUARTILE function, so it's =q. QUARTILE
| | 02:57 | is the first option. Press Tab to
accept it. And it's Table1, Sales, comma, and
| | 03:08 | quart is short for quartile.
| | 03:11 | You can have quartiles from 0 to 4.
And in this case we are looking for 25%,
| | 03:16 | so that is the first quartile. Right
parenthesis to close the formula. And the
| | 03:21 | value is 405.5. Even though the data set
doesn't include any fractional values,
| | 03:28 | the quartile value says that 25% of
the data will be below that value, which
| | 03:33 | means that if there's a 405 it will be in
the bottom 25%, or in the first quartile.
| | 03:38 | Let's do the same thing for 75%.
| | 03:41 | Type =QUARTILE, Table1, Sales, comma.
And we are looking for the third quartile.
| | 03:51 | You can calculate the quartile by
dividing the percentage by 25. So in this case,
| | 03:56 | we have 75% divided by 25 is 3,
so we are looking for the third quartile.
| | 04:01 | Everything looks good. And that value is 1,835.5.
| | 04:05 | So a quarter of the values will be
above that level, and 75% will be below.
| | 04:11 | If you were to set the quartile to 0,
that would return the lowest value, and
| | 04:15 | it's equivalent to the MIN function.
| | 04:18 | So in other words, if I were to type
QUARTILE, Table1, Sales, and then 0, that
| | 04:26 | would return the lowest value, which is 159.
| | 04:29 | If I were to edit that formula and
change the quartile to 4, that would be the
| | 04:36 | highest value, and would be
equivalent to the max formula.
| | 04:39 | So if I hit Return, we get
2,347, which is the largest value.
| | 04:46 | The last topic I'd like to
mention in this movie is called the
| | 04:49 | interquartile range.
| | 04:51 | You can find two adjacent quartile
values and subtract the lower from the higher
| | 04:54 | value to let you know how much
spread you have between your quartiles.
| | 04:57 | For example, in this case, I will
calculate an interquartile range for the third
| | 05:03 | and the second quartiles.
| | 05:05 | To do that, type =QUARTILE.
| | 05:08 | The array is Table1, Sales, quartile 3, and
then right parenthesis, minus, quartile,
| | 05:21 | Table1, Sales, quartile 2. Right
parenthesis. Everything looks good. Hit Return
| | 05:29 | and you see that the difference between
those two quartile markers is 650.5. An
| | 05:35 | interquartile range gives you an idea
of how spread out your data is, and like
| | 05:39 | the median--which I discussed in a
previous movie--is relatively unaffected by
| | 05:43 | unusually high or low values.
| | 05:46 | In the next movie, I'll show you how
to calculate a data set's variance and
| | 05:49 | standard deviation, which give you even
more powerful insights into your data.
| | Collapse this transcript |
| Calculating variance and standard deviation| 00:00 | In previous movies, I've shown you how
to summarize your data using a variety of
| | 00:03 | techniques: from finding the mean,
| | 00:06 | also called the average; the median,
which is the middle value when you sort
| | 00:09 | your data into an ascending or descending order;
and the mode, which is the most common value.
| | 00:15 | In this movie, I'll show you how to
use a data set's mean to derive two
| | 00:18 | very useful statistics:
| | 00:20 | variance and the standard deviation.
| | 00:23 | As the name implies, variance is a
measure of how much the data in a set
| | 00:27 | diverges from the average.
| | 00:29 | You calculate variance by subtracting
each individual value from the average,
| | 00:32 | squaring the result, adding the
square values together, and dividing by the
| | 00:36 | number values in the data
set. That's a mouthful.
| | 00:39 | It's a lot to keep in your head at
one time, so I'll show you an example.
| | 00:43 | The data set that I have contains five
values: 5, 10, 15, 20, and 25. To calculate
| | 00:48 | the variance for these cells,
| | 00:50 | I will type the formula, equal, and then a left parenthesis
because I'm going to be creating a two-step formula.
| | 00:57 | and that means that I need to use
parenthesis to group what I'm doing.
| | 01:01 | So you'll see how it works out in a second.
| | 01:03 | So it's =AVERAGE(, and then we're going
to be finding the average of the values
| | 01:11 | in the range A1 through A5. So it's A1.
| | 01:15 | I am going to copy this formula to
other cells, but I don't want this reference
| | 01:19 | to change when I do.
| | 01:20 | So instead of using a relative
reference like I have now, I'm going to press F4
| | 01:25 | to make it an absolute
reference which won't change.
| | 01:28 | Now I will type :A5, and again F4 to make it
an absolute reference, and a right parenthesis.
| | 01:37 | Now I'm going to subtract the value in
A1 from the average and square the result.
| | 01:43 | So I subtract A1, and I do want this cell
reference to change when I copy the formula,
| | 01:50 | so I'll leave it as a relative reference.
Type a right parenthesis. And now I'm
| | 01:55 | going to square the result.
| | 01:57 | So that is a caret, or capital 6,
if you think of it that way 2.
| | 02:02 | So this formula finds the average of
the range A1 through A5, subtracts the
| | 02:07 | value in A1 from that average,
and then squares the result.
| | 02:11 | When I press Enter, Excel displays 100.
| | 02:14 | Now when I copy that formula down,
we get 100, 25, 0, 25, and 100.
| | 02:24 | And let's take a look at the value in B1,
the result of the formula, to make sure
| | 02:27 | that it makes sense.
| | 02:29 | So we have the values 5, 10, 15, 20, and
25 in A1 through A5, so we have 15, 30,
| | 02:35 | 50, 75 divided by 5, so the average is 15.
| | 02:40 | If you subtract 5 from 15, you get 10,
and if you square the result, you get 100.
| | 02:45 | So the formula is working properly.
| | 02:47 | Now to calculate the variance, we
create the formula =SUM,(B1:B5),
| | 02:56 | and I'm not going to copy the
formula anywhere so it doesn't matter that
| | 02:59 | it's a relative reference.
| | 03:01 | And then we divide it by the number of
values in our range, so that would be
| | 03:05 | COUNT(A1:A5).
| | 03:09 | So we are finding the sum of the
variances squared and dividing by the
| | 03:17 | count of A1 through A5.
| | 03:20 | When I press Enter, we get a variance of 50.
| | 03:24 | There is an easier way to do that.
| | 03:25 | I just wanted to walk
you through it to show you.
| | 03:28 | You can use the variance or variance for the
population function to get the same result.
| | 03:34 | So to do that, you type "=VARP"--which
calculates the variance based on an entire
| | 03:39 | population--and then values in A1
through A5. Right parenthesis. Press Return
| | 03:48 | and we get the value of 50.
| | 03:49 | There are two functions that
you can use to calculate variance:
| | 03:53 | VAR and VARP. The one that I used here,
| | 03:56 | VARP, is what you use when you know
every data point in a population. In other
| | 04:01 | words I know that this is my entire
data population, so I can use VARP.
| | 04:07 | If, on the other hand, I'm working from a
sample, and I don't know every value in
| | 04:10 | a data population, then I
need to use the VAR formula.
| | 04:14 | So if I were to type "=VAR(A1:A5)" and
press Tab, then I get the value 62.5, which
| | 04:26 | is different from the
VARP formula. Why is that?
| | 04:30 | Well, the reason is that it divides
the sum of the squares by one less than the
| | 04:34 | size of the population.
| | 04:36 | In other words, instead of taking COUNT A1
through A5, it divides by COUNT A1 through A5-1.
| | 04:42 | So let me change that here by
subtracting one from the COUNT: just add a
| | 04:48 | parenthesis, minus one, and then
a right parenthesis to close it.
| | 04:53 | The formula as it reads now produces
the same result as the VAR function.
| | 04:58 | Dividing the sum of squares by one less
than the size of the population makes it
| | 05:03 | a more conservative measure, which
reflects the uncertainty inherent in
| | 05:07 | calculations using a sample of a population.
| | 05:10 | In practice, we use the VAR function
much more often than VARP, because we
| | 05:13 | rarely have data from the entire population.
| | 05:16 | Now variance is interesting, but
it's not used that much in statistics.
| | 05:21 | However, you can use a data set's
variance to calculate its standard deviation.
| | 05:25 | To find a data set's standard deviation,
you take the square root of the variance.
| | 05:30 | So, for example, if the variance for
this data set were 62.5, and we take the
| | 05:37 | square root of that value, which is
in B9, and press Tab, we get 7.9, so on.
| | 05:47 | So is there a separate function
to calculate standard deviation?
| | 05:50 | Well, of course there is. There are two.
| | 05:52 | And in the same way that you can
calculate variance based on either a sample or
| | 05:57 | the entire population and use the VAR
or VARP function, standard deviation has
| | 06:03 | the STDEV and STDEVP functions that you can use.
| | 06:07 | And I will show you both of them.
| | 06:09 | So, for example, if you wanted to
find the standard deviation of the values
| | 06:12 | in A1 through A5, you can type
"STDEV(A1:A5)", Return, and you get the same
| | 06:23 | value that we had here.
| | 06:25 | If you know all the values, you can use
STDEVP, which is =STDEVP(A1:A5), Return,
| | 06:38 | and you get 7.07, which, as it
turns out, is the square root of 50.
| | 06:44 | So, everything matches up.
| | 06:46 | Standard deviation provides extremely
useful information about a data set.
| | 06:50 | As I'll show you throughout the rest of
the course, knowing a data set's standard
| | 06:53 | deviation lets you make predictions
about the entire data set, even if you only
| | 06:57 | know a portion of the values.
| | Collapse this transcript |
| Summarizing data using descriptive statistics| 00:00 | Excel has quite a few statistics
functions you can use to summarize your data.
| | 00:04 | Rather than make you create individual
formulas for each measure, Excel groups
| | 00:08 | the most commonly used formulas into a
single descriptive statistics tool, which
| | 00:13 | is available through the Analysis ToolPak.
| | 00:15 | To use the Analysis ToolPak, you
click the Data tab. And then if you installed
| | 00:21 | the Analysis ToolPak add-in, you
will see it here under the Data Analysis
| | 00:26 | button, in the Analysis group.
| | 00:28 | If you don't see that button on the
Data tab, then you need to install the
| | 00:32 | Analysis ToolPak add-in.
| | 00:33 | I explained how to do that in
a movie earlier in this course.
| | 00:37 | To begin your analysis, click the Data
tab and then click Data Analysis. And in
| | 00:43 | the Data Analysis dialog box, click
Descriptive Statistics and click OK.
| | 00:49 | When you do, the Descriptive
Statistics dialog box opens.
| | 00:53 | You can now select your Input Range. And
the data that we want to analyze is in
| | 00:57 | this column of our table, so we will
click the Collapse dialog button at the
| | 01:02 | right edge of the Input Range box,
and then select this table column.
| | 01:08 | To do that, you move your mouse pointer
over the header until the mouse pointer
| | 01:13 | changes to downward-pointing black arrow,
and when you click, Excel select cells
| | 01:19 | B2 through B32, skipping
the header cell here in B1.
| | 01:24 | So our data is in B2 through B32,
so everything looks correct.
| | 01:29 | The data is grouped by columns. And we
don't have a label in the first row, so
| | 01:33 | we don't need to check that box.
And I do want to put the data on a new
| | 01:37 | worksheet, so I will
leave all of those the same.
| | 01:40 | If I did wanted to put my descriptive
statistics on the same worksheet, then I
| | 01:45 | could select the Output Range option,
and select another cell, such as D2. But in
| | 01:51 | this case, I do want to put it on
the new worksheet, and I will click the
| | 01:55 | Summary Statistics option.
| | 01:58 | Everything looks good. Click OK
and Excel creates my new worksheet.
| | 02:03 | I am going to format it a little bit,
just so it makes it easier to read.
| | 02:07 | So I will double-click the right
edge of the column A column header.
| | 02:11 | When I do, Excel expands the column
to display the widest value, Standard
| | 02:17 | Deviation in column A, and it also
displays the values for each of these
| | 02:22 | measures in column B. And I'll zoom in just
a little bit to make the data easier to read.
| | 02:29 | The Descriptive Statistics tool
calculates many different measures.
| | 02:33 | So, for example, you have the mean,
which is the average, standard error, which is
| | 02:37 | the Standard Deviation divided by
the square root of the sample size.
| | 02:41 | You have the media and the mode--
in this case there was no value that
| | 02:43 | occurred more than once--standard
deviation, variance--which, again, is the
| | 02:48 | square of the standard deviation.
| | 02:50 | Now these two values we are not going
to be using in this course, but I'm going
| | 02:54 | to explain them briefly.
| | 02:55 | Kurtosis represents how peaked a
curve is; in other words, if you look at
| | 03:00 | the data on a curve, how many values
occur toward the center versus how many
| | 03:05 | occur toward the side?
| | 03:07 | If you have a lot more values toward
the center, then kurtosis is higher.
| | 03:10 | In this case, the data set is
fairly well spread out or fairly even, so
| | 03:14 | kurtosis is quite low.
| | 03:16 | Skewness tells you whether the tail of
the curve is to the left of the average,
| | 03:19 | which is negative skew--which is what
we have here, but just barely--or to the
| | 03:23 | right, which is positive skew.
| | 03:25 | And finally, range is the spread between
the minimum value and the maximum value.
| | 03:31 | If you want to take a general look at
your data set before performing more
| | 03:35 | detailed analyses, running the
Descriptive Analysis tool provides a host of
| | 03:39 | commonly used analyses in a few steps.
| | 03:41 | Once you've looked over these values,
you can continue your analysis using
| | 03:45 | other methodologies.
| | Collapse this transcript |
| Summarizing data using z-scores| 00:00 | Statisticians divide data sets into
two general categories based on size.
| | 00:05 | Large data sets have 30 or more elements,
and small data sets have less than 30.
| | 00:10 | In this movie, I'll show you how to
analyze data sets using z-scores, which
| | 00:14 | are used with large data sets.
| | 00:17 | A z-score and also a T score--which we
will see in the next movie--represent the
| | 00:21 | number of standard deviations
a value is from the mean.
| | 00:24 | To look at some values we have here, I
have a data set over here on the left
| | 00:29 | that are daily sales for the month of
March. And I happen to know, because I
| | 00:35 | created the data, that the
mean, or average, of sales is 1,151.
| | 00:39 | So what I am going to do is use
Excel's ZTEST function to calculate the
| | 00:45 | probability that this mean of 1,200 is
greater than the population's actual mean
| | 00:51 | based on the mean from the sample.
| | 00:53 | So this sample here is not the entire
data set; it's a sample of it. And the mean
| | 00:59 | of the sample is 1151.
| | 01:01 | So the question is what is the
probability that the mean of 1200 is greater than
| | 01:06 | the population's actual
mean based on the data we have?
| | 01:09 | So in cell D5, I will create the
formula =ZTEST, and the data we are going to
| | 01:16 | use is in the Sales column
of the Daily Sales table.
| | 01:20 | So I'll move my mouse pointer over the
header of the Sales column, and when the
| | 01:26 | mouse pointer changes into a downward-
pointing black arrow, click, and Excel
| | 01:30 | fills in that table reference in the cell.
| | 01:34 | Now type a comma. And then x
is the mean that we want to test.
| | 01:39 | That value is in cell D8, so we will
click there. Right parenthesis. Formula
| | 01:45 | looks good. Type Return
and we get the value of .6479.
| | 01:50 | So that means that it's about 65%
likely that this mean of 1,200 is greater than
| | 01:56 | the mean of the data population.
| | 01:59 | Now how confident are we in our 64.7%, or 64.8?
| | 02:05 | Well, for that we can use the
z-scores for various confidence levels.
| | 02:09 | A confidence level indicates the
number of values that will occur within a
| | 02:14 | specific number of standard
deviations of the mean, plus or minus.
| | 02:18 | So here we can see that 50% of the
values in a data set will occur within .674
| | 02:27 | standard deviations of the mean, plus or minus.
| | 02:30 | And we are assuming that our data set
conforms to the normal data distribution.
| | 02:35 | That's the most common
distribution, so we use it.
| | 02:37 | But if you for any reason suspect
your data might not conform to the normal
| | 02:42 | distribution, then you won't be able to
use z-scores. But most of the time it will.
| | 02:46 | Within one standard deviation of the
mean plus or minus, you will find 68.3% of
| | 02:51 | your values; within 1.645 standard
deviations, 90%; and the most commonly used z
| | 02:58 | score of 1.96 standard deviations plus
or minus, you will find 95% of the values
| | 03:04 | in your data set within 1.96
standard deviations of the mean.
| | 03:09 | Calculating the number standard
deviations a value falls from the mean helps
| | 03:13 | you analyze your data in many different ways.
| | 03:15 | Z scores are used with large data sets,
and they help you decide how confident
| | 03:19 | you are that a value falls
within a certain range on the curve.
| | Collapse this transcript |
| Summarizing data using t-scores| 00:00 | In the previous movie, I discussed
z-scores, which you use to analyze data sets
| | 00:04 | with 30 or more elements.
| | 00:06 | In this movie, I'll discuss t-scores,
which you use with small data sets--those
| | 00:11 | with fewer than 30 elements.
| | 00:13 | Like a z-score, a t-score measures the
number of standard deviations from a mean
| | 00:17 | that a value is, but the formula takes
the smaller sample size into account.
| | 00:22 | In Excel, you calculate a target
t-score by using the TINVERSE function, which
| | 00:27 | takes two arguments of
probability and degrees of freedom.
| | 00:30 | So the first thing we need to do is
to calculate the probability, and the
| | 00:35 | probability we are calculating is
the probability that the difference
| | 00:39 | between two sample averages--in this case,
income and average order--is due to random chance.
| | 00:47 | To find that probability, we are
going to create a t-test formula.
| | 00:51 | So in cell E1 type
=TTEST(, and then in the Levels
| | 00:58 | table, which is here on the left, we
are going to use the Income column.
| | 01:03 | So I move my mouse pointer until it
is a downward-pointing black arrow over
| | 01:08 | the Income column header, and when I click,
Excel fills in that reference in the formula.
| | 01:14 | Now, I type a comma and do the same
thing for the Average Order column.
| | 01:20 | So I have my black arrow.
| | 01:21 | It's over the column header. Click there
and I have my Average Order. And type a comma.
| | 01:28 | The next argument is tails.
| | 01:30 | So in other words, if I were to put
in the value 2, I would be looking at
| | 01:35 | whether the value is either higher or lower--
| | 01:37 | in other words, if it's in the extremes
of the curve, either above or below the
| | 01:42 | average. In this case, I do want that
because I am checking to see whether the
| | 01:46 | two differ--and I don't care in which
direction--so I'll type a 2 for the number of tails.
| | 01:51 | If I only wanted to know whether the
value was greater than what we find, then I
| | 01:55 | would have typed 1. That would
look to the right edge of the curve.
| | 01:59 | Finally, we need to tell Excel what
type of t-test we want to perform.
| | 02:03 | There are three types, and the only
type we are going to deal with in this
| | 02:07 | course is where we are comparing the
means of two separate columns of data, and
| | 02:12 | that is test number 1, the most common.
| | 02:16 | If you go into the Excel Help file,
you'll see the descriptions of the other
| | 02:19 | two, but they're only used
in special circumstances.
| | 02:22 | So we have our formula, and I'll close
it with a right parenthesis. Everything
| | 02:26 | looks good, and we have our
probability of a very small number--that is six
| | 02:31 | zeroes, followed by a 6, percentage.
| | 02:34 | So in other words, it is a very low
probability that the differences between the
| | 02:39 | averages of income and
average order are due to chance.
| | 02:44 | With that data, we can calculate our t-score.
| | 02:48 | To calculate the t-score, we
use the TINVERSE function.
| | 02:51 | So we have =TINV(. The first argument asks
for our probability, and that is in cell E1.
| | 03:00 | We just calculated that. Comma.
| | 03:03 | Then we have our degrees of freedom.
| | 03:06 | Degrees of freedom is the number
of measurements in your data set -1.
| | 03:11 | The reason it's called degrees of
freedom is outside the scope of this course,
| | 03:15 | but just a way to remember it is
you have say 20 measures over here,
| | 03:19 | 20 elements, so your degrees of freedom are 19.
| | 03:23 | To enact that into the formula, type
COUNT( and then the
| | 03:29 | Average Order column, right parenthesis.
And because we're calculating degrees
| | 03:36 | of freedom, which is one less than the
number of elements, we take the count
| | 03:40 | and subtract one. Right parenthesis, and
press Return, and we get our value of 10.29.
| | 03:48 | We interpret that t-score using
the number of degrees of freedom and
| | 03:53 | our confidence level.
| | 03:54 | So in other words, if we want to be
95% certain, that the difference in the
| | 03:58 | income average as compared to the
difference of the average order is
| | 04:03 | statistically significant at the 95% level,
| | 04:07 | we need to go to our t lookup table,
which I have on this separate worksheet.
| | 04:12 | So again, our t-score is 10.29.
| | 04:15 | Go in to the other worksheet, we can
scroll down in our Degrees of Freedom
| | 04:19 | table, and with 19 degrees of freedom,
we see at the 95% level that we need to
| | 04:25 | have a t-score of 1.729 or greater.
| | 04:28 | Our t-score is over 10, so it's very
likely that the difference between the
| | 04:34 | averages of income and order
vary together significantly.
| | 04:39 | Calculating the number of standard
deviations a value falls from the mean
| | 04:42 | helps you analyze your
data in many different ways.
| | 04:44 | T-scores, which represent the number of
standard deviations from the mean in a
| | 04:48 | small data set, figures
prominently in these formulas.
| | 04:51 | We'll use it many more
times later on in this course.
| | Collapse this transcript |
| Calculating covariance| 00:00 | Business owners are always looking for
relationships between two sets of data--
| | 00:04 | whether sales discounts results in
higher revenue or whether the colors used in
| | 00:08 | an advertisement make more
customers come to your store.
| | 00:11 | In this movie, I'll show you how to
calculate one measure of change: covariance.
| | 00:15 | In the next movie, I'll show you
how to calculate a related measure
| | 00:19 | called correlation.
| | 00:20 | Covariance is the average of the
products of deviations for each data point in
| | 00:25 | a pair of data sets.
| | 00:27 | In other words, you find the average or
mean for two columns of data, calculate
| | 00:31 | how far each value is from its
column's average, and then multiply the
| | 00:35 | difference for each pair of values.
| | 00:37 | So let me show you what that
means within the worksheet here.
| | 00:41 | The question I'm interested in
answering is whether temperature affects the
| | 00:45 | number of customers that come in to my store.
| | 00:48 | So I have a column of temperature data
and customer data, and these are matched.
| | 00:52 | So in other words, on the day that the
temperature was 50 degrees; 24 customers came
| | 00:57 | into the store; and the day it
was 53, 20 came in; and so on.
| | 01:00 | So what I want to do is subtract 50 from
the average in this column, square that
| | 01:06 | difference, and then add it to the
same calculation of the customers.
| | 01:09 | I want to calculate the amount that 24
customers is below, or different from, the
| | 01:15 | average number of customers that I
have per day, and then add those two
| | 01:19 | differences together.
| | 01:20 | In cell C2, I'll type "=(A2-average(",
and then I'm going to be working with the
| | 01:31 | data in the Temperature column,
| | 01:33 | so I move my mouse
pointer over the column header.
| | 01:36 | When it turns into a downward-
pointing black arrow, I can click, and Excel
| | 01:41 | enters in the Temperature column.
| | 01:43 | And the reason it says Temperature
instead of the entire table reference is
| | 01:47 | because I'm referencing this column
of the table from another column or
| | 01:51 | cell within the table.
| | 01:53 | So it doesn't need to say the table
name; it just says, look in this table in
| | 01:56 | the Temperature column.
| | 01:58 | I type 2, right parenthesis, so that I
have A2, minus the average of that columns
| | 02:04 | values, multiplied by, left parenthesis,
B2, and we are creating the same formula
| | 02:10 | for the Customers column.
| | 02:12 | So it is B2-average, and
we're selecting the Customers column.
| | 02:18 | Now, I can press the right parenthesis
and then another right parenthesis to
| | 02:23 | close my function. Everything looks good.
| | 02:25 | Press Return and you see the values.
| | 02:28 | To give you an idea of what's going on
within this formula, I'm going to leave
| | 02:32 | cell C3 selected, and then I am going
to walk through each individual step of
| | 02:37 | this formula, so that you can
see how it's getting its value.
| | 02:40 | To do that, I will click on the
Formulas tab on the Ribbon and click the
| | 02:44 | Evaluate Formula button.
| | 02:47 | So, Excel displays the formula, and
it underlines the next element that is
| | 02:52 | going to be calculated.
| | 02:53 | So if I click the Evaluate button,
Excel fills in the value of 53 from cell A3;
| | 02:59 | that was the reference that was
underlined before. Click Evaluate again, and
| | 03:03 | now it's going to look at the
average of A23 through A26. That average is 64.64.
| | 03:09 | Now we're going to subtract this, too.
Evaluate and I'll just step through.
| | 03:12 | We have the result of that previous
calculation, and now we're going to look at
| | 03:17 | the value in B3, which is 20.
Find the average of the customer's value.
| | 03:21 | First, it identifies all
the cells in that column.
| | 03:24 | Then it finds the value 44.08.
It subtracts 44.08 from 20, and then it
| | 03:31 | multiplies -11.64 by -24.08.
| | 03:36 | Click Evaluate and you get
your difference of 280.2912.
| | 03:44 | So that's how those calculations are done.
| | 03:47 | So now, how do I use this list of differences?
| | 03:49 | Well, what I do is I find the average
of all those differences. And at the
| | 03:53 | bottom of my table I have a total row.
| | 03:56 | Right now, the total row is finding
the sum of all of these differences.
| | 04:00 | What I want it to do is find the average.
| | 04:03 | So to do that, I click the total row
cell that I am interested in, click the
| | 04:06 | down arrow, and click Average.
| | 04:09 | You see that I have a covariance of 115.0288.
| | 04:13 | There is a quicker way to do this, and
that is by creating a covariance formula.
| | 04:18 | So in cell F1, I am going to click =COVAR.
| | 04:23 | C-O-V-A-R, that's the name of the function
you use, and then it's array1 and array2.
| | 04:28 | Array1 is Temperature, and a
comma, and then Customers ByTemp--
| | 04:34 | that's the name of the table--and then
the Customers column. Right parenthesis.
| | 04:38 | Everything looks good. Press Return
and I get my same value of 115.0288.
| | 04:44 | A positive covariance means the
variable seem to be positively related,
| | 04:48 | meaning that they increase
or decrease at the same time.
| | 04:51 | If you multiply two negative values
together, you get a positive result.
| | 04:55 | So the more times both deviations are
negative or both are positive, the larger
| | 04:59 | the covariance will be.
| | Collapse this transcript |
| Calculating correlation| 00:00 | In the previous movie, I showed you one
way to measure how two variables tend to
| | 00:04 | change together by using covariance.
| | 00:06 | In this movie, I'll show you
another very useful way to measure how two
| | 00:10 | variables tend to change
together; that is correlation.
| | 00:13 | correlation is related to
covariance, but it uses a different formula
| | 00:17 | to generate its value.
| | 00:18 | The idea is that a correlation
coefficient can have a value from negative one
| | 00:22 | to one, and this worksheet that I have
here displays those three examples.
| | 00:27 | And these are all perfect examples,
meaning that they'd probably never occur in
| | 00:32 | real life, but they give the example of what
a correlation of 1, -1, and 0 would look like.
| | 00:37 | Correlation of one is what's
called positive correlation, or perfect
| | 00:41 | positive correlation.
| | 00:42 | That means that whenever one value
increases, the other value increases at
| | 00:46 | exactly the same rate.
| | 00:48 | So the data here, we have 1, 2, 3, 4, 5,
and that is the data I used to plot
| | 00:53 | the horizontal axis. And then the vertical
axis also contains the data 1, 2, 3, 4, 5,
| | 01:00 | so it increases at the same rate.
| | 01:03 | Correlation of negative one means that
there's a negative correlation, where one
| | 01:07 | value goes up, the other goes down, and
again, in exactly the same proportion.
| | 01:12 | So we have 1, 2, 3, 4, 5, and Column 2:
| | 01:15 | 5, 4, 3, 2, 1. And you can see how that
correlation is represented in the body of the chart.
| | 01:21 | A correlation of zero means that there
is absolutely no relationship between the
| | 01:25 | two data columns whatsoever.
| | 01:28 | So, for example, if you look at the
first pair, you see that I have 1 and 1,
| | 01:32 | and that's started here.
| | 01:34 | The next row is 2 and 1, so that is here,
3 and 1, and so on. And I have every
| | 01:42 | possible pair of values.
| | 01:43 | I have a 1 and a 1, a 1 and a 2, and 1 and a
3, and I have similar pairs for 2 and 3
| | 01:48 | as my starting values.
| | 01:50 | So that means, as you can see in the
plot, that you cannot use the value on the
| | 01:54 | horizontal axis to predict anything
about the data, or the value, that will occur
| | 01:59 | on the vertical axis.
| | 02:01 | It's completely spread out, so that means
that this example has a correlation of zero.
| | 02:05 | Now let's take a look at a
business case to show you how you can use
| | 02:08 | correlation in your own work.
| | 02:10 | To do that, we'll switch to
the Correlation worksheet.
| | 02:14 | The data in this Excel table shows
how the number of customers varies by
| | 02:19 | day, and the days measure is the number of
days after your company has sent out a mailer.
| | 02:25 | So, in other words, you want to see how
effective the mailer was, but you also are
| | 02:29 | interested in how much that mailer's
effect trailed off over time, and that is
| | 02:33 | what this table measures.
| | 02:35 | If you want to create a correlation
formula for this data, you can do so
| | 02:39 | by clicking equal--and I am in cell E1--so
it's equal and then C-O-R-R-E-L,
| | 02:44 | that's the correlation function. Left
parenthesis. And then you just identify
| | 02:48 | the data you want to use.
| | 02:50 | So the first column is days, and I
selected that column by moving my mouse
| | 02:56 | pointer over the column header. And when
the mouse pointer changed to a downward-
| | 03:00 | pointing black arrow, I clicked.
| | 03:02 | When I did, Excel selected that table's column.
| | 03:06 | Type a comma, and I'll
select the Customers column.
| | 03:10 | So I have Days and Customers, those are
my two arrays. Right parenthesis. Press
| | 03:15 | Return and I get my correlation of -0.9,
and a small decimal value after that.
| | 03:21 | As a general rule, correlation
coefficients with absolute values--
| | 03:25 | that is, the distance from zero--a
three or less are considered to be weak,
| | 03:29 | 0.3 to 0.7 are considered moderate,
and 0.7 or higher are considered strong.
| | 03:35 | So in this case, a correlation
coefficient of -0.9 means that there's a very
| | 03:40 | strong negative correlation between
the number of days after you mailed a
| | 03:44 | brochure to your customers and the number of
customers who come in because of that brochure.
| | 03:49 | Determining whether two sets of
values are correlated is in one sense of
| | 03:52 | mathematical question,
| | 03:54 | but you can't take a correlation
coefficient of 0.95 to mean that some invisible
| | 03:58 | factor makes two unrelated
data sets change at the same rate.
| | 04:02 | You must use your common sense and
business sense to determine whether the
| | 04:06 | correlation result is true, and--most
importantly--discover why or why not.
| | Collapse this transcript |
| Calculating correlations for more than two data sets| 00:00 | In the previous movie, I have showed
you how to calculate covariance and
| | 00:03 | correlation coefficients for two data sets.
| | 00:06 | Because the functions limit you to
comparing two data sets at a time, comparing
| | 00:09 | every column of data in the collection
to every other column can take a long
| | 00:13 | time, particularly for
tables with five or more columns.
| | 00:16 | If you have multiple data sets, you can
use the Analysis ToolPak's Correlation tool
| | 00:21 | to calculate the correlations between
every pair of columns much more quickly
| | 00:25 | than you could do by hand.
| | 00:26 | So, for example, let's say that your
business captured information about your
| | 00:30 | customers with their income level,
their education level--and in this case the
| | 00:35 | education level means the number of years
of school that they attended--and also
| | 00:39 | their satisfaction with your
company, and their order total.
| | 00:42 | So our goal in this movie is to look for
correlations between each pair of columns,
| | 00:47 | in other words, between Income and
Education, Income and Satisfaction, Income
| | 00:51 | and Order Total, and so on.
| | 00:53 | To begin, we will go to the Data tab, and on
the right end, we will click Data Analysis.
| | 01:00 | Data Analysis is part of the Analysis ToolPak.
| | 01:03 | So if you don't see Data Analysis on
the Ribbon then you need to install
| | 01:07 | the Analysis ToolPak.
| | 01:09 | I showed you how to do that in
a movie early on in this course.
| | 01:12 | So we will click Data Analysis.
| | 01:14 | Then in the Data Analysis dialog box,
click Correlation, and click OK.
| | 01:21 | Now we can select or identify our Input Range.
| | 01:25 | The Input Range is cell A1 through D50.
| | 01:30 | I'm typing in--rather than
selecting--the table columns, because I want to
| | 01:34 | include my labels in this case.
| | 01:37 | So I will click Labels in First Row.
| | 01:39 | That will make a difference in the output
that you will see in just a few seconds.
| | 01:43 | I will select my Output Range.
| | 01:45 | I will put my output here on the same worksheet.
| | 01:51 | So F1, identifying the top left
corner of the Output Range, and click OK.
| | 01:57 | When I do, Excel displays the
results of the correlation analysis.
| | 02:01 | I will just open up the columns, so
that you can see the entire values.
| | 02:09 | I do that by double-clicking
the right edge of a column header.
| | 02:12 | That expands the column to
display all of the column's contents.
| | 02:16 | So now I can look at the
correlations between each set of columns.
| | 02:20 | Income is correlated to Income perfectly
because it's the same set of values.
| | 02:24 | Double-click here so we
can see the entire names:
| | 02:27 | Education to Education, Satisfaction to
Satisfaction, Order Total to Order Total.
| | 02:32 | Now we can start looking to see if there
are any strong correlations between any
| | 02:35 | two pairs of columns.
| | 02:37 | So, for example, Income and
Education are correlated weakly at a very low
| | 02:42 | value, .07--not very significant.
| | 02:45 | Same thing for Income and Satisfaction.
| | 02:47 | Income and Order Total are
correlated at a .25 level.
| | 02:52 | But again, anything below .3 is
considered to be a weak correlation, between .3
| | 02:57 | and .7 is moderate, and anything over
.7 is seen to be a strong correlation.
| | 03:04 | In this case, none of our values are
particularly high or particularly low.
| | 03:08 | That is far from 0. So, based on this data, we can
determine that there is no strong correlation
| | 03:15 | between income and education, income
and customer satisfaction, income and
| | 03:19 | order total, and so on.
| | 03:21 | The Analysis ToolPak's Correlation tool
lets you find the correlations between
| | 03:25 | every pair of columns in an Excel table.
| | 03:27 | Even if you just want to find the
correlation of values in two or three pairs
| | 03:30 | of columns, using the Correlation tool helps
you do so more quickly than you could by hand.
| | 03:35 | Be sure to check the coefficients for
every pair of columns: you might find a
| | 03:39 | correlation you didn't expect.
| | Collapse this transcript |
| Calculating and interpreting chi-square values| 00:00 | One important question you can ask
when you analyze your data is whether
| | 00:03 | the observed frequency distribution of a
data set differs from a theoretical distribution.
| | 00:08 | In this example, we're going to be
looking at whether the state a customer is
| | 00:11 | from affects whether they
come to your store or not.
| | 00:14 | You can find these values by
performing what's called a chi-squared test.
| | 00:18 | The chi-squared test is a non-
parametric test, which means that the formulas
| | 00:22 | don't assume the data you're testing
conforms to a known data distribution, such
| | 00:26 | as the normal, or bell curve, distribution.
| | 00:29 | Let's say we want to examine the
number of customers that came in from three
| | 00:32 | States--Connecticut, New Jersey, and
New York--with a goal discovering whether
| | 00:37 | the number of customers that come in
depends upon the state that they're from.
| | 00:42 | You can analyze these values
using the chi-square distribution.
| | 00:45 | To do that, you use the
CHIDIST function in Excel.
| | 00:50 | The data that you need to perform a chi-
squared analysis is the number of actual
| | 00:55 | customers that came in from the three
States--Connecticut, New York, and New
| | 00:58 | Jersey--and the expected number.
| | 01:01 | The expected number is the average of the total.
| | 01:04 | So in other words here,
143+124 is 267+93 is 360.
| | 01:12 | If that were divided evenly, then each
state would have provided 120 customers.
| | 01:17 | So we are going to check, using the chi-
squared test, whether these numbers--143,
| | 01:22 | 124, and 93--vary
significantly from what would be expected.
| | 01:26 | Now, we can create a formula in
cell F1, and that is =CHITEST(.
| | 01:36 | We have our OrderCount table's
Actual column, comma, and our OrderCount
| | 01:41 | Expected column. Type a right parenthesis.
| | 01:47 | We see a chi-squared value of .00495.
| | 01:51 | You interpret the result based
on your desired confidence level.
| | 01:55 | The most commonly used confidence level
is .05, which means that you want to be
| | 01:59 | 95% sure that you can reject the
statement that a customer's State does not affect
| | 02:05 | the number of customers that arrive.
| | 02:07 | Note that the result is much less than .05,
| | 02:11 | so that means that we can reject the
statement that State has no effect.
| | 02:15 | Now let's say that we change the values
a little bit and say we got 125 customers
| | 02:22 | from Connecticut, 120 customers from New
York, and 115 customers from New Jersey.
| | 02:30 | In that case, the chi-squared
value goes all the way up to .8.
| | 02:34 | That means that we reach
the opposite conclusion.
| | 02:37 | Because the value is greater than .05,
we can reject the statement that a
| | 02:41 | customer's State doesn't affect the
number of customers that come in.
| | 02:45 | In other words, in this data set, it
doesn't appear to matter which State a
| | 02:48 | customer comes from;
| | 02:50 | we get customers on an about
equal level from all three States.
| | 02:54 | Finally, if we were to change the
actual values to the expected values of 120,
| | 02:58 | 120, and 120, we would see that the chi-
squared value is one, which indicates that
| | 03:05 | everything is occurring exactly as expected.
| | 03:08 | The chi-squared distribution lets you
determine whether the frequencies in your
| | 03:11 | data set differ from the expected
counts at a statistically significant level.
| | 03:16 | This test is particularly versatile
because, unlike many of the techniques I
| | 03:20 | will show you later in this course, you
don't need to make any assumptions about
| | 03:23 | your data's distribution.
| | Collapse this transcript |
|
|
5. Using Probability DistributionsVisualizing the normal data distribution| 00:00 | Probably the most familiar graph in any
statistics course is that of the normal,
| | 00:05 | or bell curve, data distribution.
| | 00:07 | In this movie, I will describe the
components of the normal curve and then show
| | 00:11 | you how to visualize it in Excel
by creating an X/Y scatter chart.
| | 00:15 | In this case, we are analyzing data from
the Two Trees Olive Oil Company, and we
| | 00:20 | are looking at sales for their retail shop.
| | 00:22 | The average sales per day is 2,000,
and the standard deviation is 500.
| | 00:28 | Earlier on, I created this chart,
which I've now rendered as a graphic,
| | 00:33 | and you can see that most of the values,
which are the daily sales will be around 2,000.
| | 00:38 | As you get closer to 2,500, you will see
that the area under the curve increases.
| | 00:44 | So when you're looking within one
standard deviation, which is 500, from the
| | 00:48 | mean, you'll see that the entire area of the
curve, from 2,500 to 1,500, is represented there.
| | 00:55 | So that means at about 60% of your sales
will fall within one standard deviation,
| | 01:00 | or between 1,500 and 2,500 per day.
| | 01:03 | Days with sales of $3,500 are rare,
as are days with sales of $5,00.
| | 01:09 | You see how far they are
out in the tails of the curve.
| | 01:12 | Now, I am going to delete this graphic
and show you how to create the same thing
| | 01:17 | using an Excel chart.
| | 01:19 | So I will press the Delete key and away it goes.
| | 01:22 | Two things you need to know are the mean
and the standard deviation, which we have here.
| | 01:27 | Now you use an Excel table with three
columns to make the calculations that you need.
| | 01:31 | The first thing you're going to do is
calculate the value that is 4 standard
| | 01:36 | deviations from the mean.
| | 01:38 | In other words, it's going to be the
mean plus four times the standard deviation.
| | 01:44 | The answer will be 4,000,
but let's see how we get there.
| | 01:46 | First, we create the formula, =A2, which
is the number standard deviations, and
| | 01:53 | then we multiply that value by H1.
| | 01:56 | Now we are going to be copying this
formula down to the rest of the rows in the table,
| | 02:02 | so we want to make H1 an absolute
reference so it doesn't change.
| | 02:05 | So I will press F4.
| | 02:08 | And now we need to add the
mean, which is in cell F1.
| | 02:11 | So that's +F1, and I press F4 again to make
an absolute reference so it doesn't change.
| | 02:20 | When I press Enter, Excel copies the
formula all the way down through the table.
| | 02:25 | If I scroll down using my scroll wheel,
you'll see that we get down to what is
| | 02:30 | essentially 0, all the way
down at -4 standard deviations.
| | 02:34 | It's represented that way as
-5E-12, which is a very small number.
| | 02:40 | It's 12 zeros and then a 5.
| | 02:42 | It's represented that way
because of rounding errors in Excel.
| | 02:45 | The answer is effectively 0.
| | 02:46 | I am going to scroll back up
to the top using my scroll wheel.
| | 02:50 | Now we need to determine the
probability that a specific value would occur.
| | 02:55 | So in other words, what is the
probability, given a mean of 2,000 and a standard
| | 03:00 | deviation of 500, that we
will have daily sales of 4,000?
| | 03:02 | Well, to do that, we use this formula.
| | 03:06 | It's =NORMDIST, which stands for normal
distribution, a left parenthesis, and then B2.
| | 03:16 | We are looking for the X, in other words,
the value that we are going to compare
| | 03:20 | to the distribution.
| | 03:21 | That is value in B2,
in this case 4,000.
| | 03:25 | Then we have the mean, which is
2,000, and that is in cell F1.
| | 03:30 | We don't want that reference changing,
| | 03:32 | so I'll press F4, then a comma, and
the standard deviation--that's in H1.
| | 03:38 | Again, I don't want that changing, so
I will press F4 to make it an absolute
| | 03:42 | reference and type a comma.
| | 03:44 | Now we need to decide whether we want to
make the calculation cumulative or not.
| | 03:49 | Cumulative, if we set this argument to
true, would ask the question, what is
| | 03:53 | the likelihood that the value
is greater than the number in B2?
| | 03:59 | So, any value to the right
in the curve of cell B2.
| | 04:03 | If we make the value false, then what
is the probability that the exact value
| | 04:09 | of 4,000 will occur?
| | 04:11 | That's what we want.
| | 04:12 | So I will type in "false".
| | 04:13 | Type a right parenthesis to finish the formula,
press Return, and there we have our values.
| | 04:20 | So now we need to select the data that
we want to put into the chart, and those
| | 04:25 | are the Value and Probability columns.
| | 04:28 | So I will select those two table
columns, and then on the Insert tab, click
| | 04:35 | Scatter. And we are creating a
scatter chart, not a line chart,
| | 04:39 | but we are selecting the subtype of
scatter chart called Scatter with Smooth
| | 04:43 | Lines. That compares pairs of values on
an X/Y grid, and it draws lines between
| | 04:47 | them to smooth them out.
| | 04:49 | So I will click that button, and
you see that we have our chart.
| | 04:54 | The left axis is the individual probability.
| | 04:57 | In other words, what is the
probability of the exact value, 2,000, occurring in
| | 05:02 | the stated distribution?
| | 05:03 | The answer is pretty low, as you can
see from the values in the vertical axis.
| | 05:09 | But what you can see is that you
have 2,000, 2,001, 2,002, and so on,
| | 05:13 | so the probability that the daily sales
are somewhere around 2,000 is quite high
| | 05:19 | when you add all of the probabilities together.
| | 05:21 | Now, let's see what happens when I
change the mean or the standard deviation.
| | 05:25 | I will start by changing the standard deviation.
| | 05:27 | Let's say that we have a much narrower
curve with a standard deviation of 250.
| | 05:34 | When Excel updates the formula,
you'll see that the values between 1,500 and
| | 05:38 | 2,500 are much, much more likely, where values
out toward the tails are increasingly unlikely.
| | 05:44 | If I create a chart with a mean of 5,000
and the standard deviation of 2,000, you
| | 05:51 | will see that the curve is broader.
| | 05:56 | You might encounter the term "standard
normal distribution" or "standard normal curve."
| | 06:01 | The standard normal curve refers to a
normal distribution with a mean of 0 and a
| | 06:06 | standard deviation of 1.
| | 06:08 | The normal curve has many useful
properties you can use to summarize your data.
| | 06:12 | Once you know your dataset's mean and
standard deviation, you can graph the
| | 06:16 | curve to get a better feel for
how your data is distributed.
| | Collapse this transcript |
| Generating random values| 00:00 | Most of the data you use in Excel comes
from measurements you've taken from your
| | 00:04 | company's operations:
| | 00:05 | sales, salaries, delivery times, and so on.
| | 00:08 | However, suppose you want to generate
random values to test new assumptions
| | 00:12 | about your business.
| | 00:14 | For example, if you work for Two Trees,
suppose that you have found that your
| | 00:18 | daily sales averaged $2,000, and that
you have a standard deviation of $500.
| | 00:24 | You can use the techniques I'll show you
in this movie to generate random values
| | 00:28 | under that normal data distribution.
| | 00:31 | I'm going to start up though, by showing
you how to generate random numbers using
| | 00:35 | the RAND and RANDBETWEEN functions.
| | 00:38 | RAND generates a number somewhere
between zero and one, and the formula is
| | 00:44 | very straightforward.
| | 00:45 | You type in =RAND and then
a pair of empty parentheses.
| | 00:51 | The only reason you have to have the
parentheses there is because you need to
| | 00:55 | have them so Excel knows it's a formula.
| | 00:58 | When you press Enter, Excel generates a
random decimal value between zero and one.
| | 01:03 | Now RAND, and this is also a true of
RANDBETWEEN, which we'll get to in a
| | 01:07 | moment, is what's called a volatile function.
| | 01:10 | Every time Excel recalculates your
workbook, such as when you edit a value,
| | 01:14 | enter a new value into a cell, and so on, it
recalculates the value of the RAND formula.
| | 01:20 | So if I click on the Formulas tab and
go to the Calculation group and click
| | 01:26 | Calculate Now, you'll see that Excel
recalculates the worksheet and displays a new RAND value.
| | 01:31 | If I do it again, we'll get a
new value, and finally, a third time.
| | 01:35 | The shortcut key for Calculate Now is F9.
| | 01:38 | So if you'd rather, you can press
that and force Excel to recalculate
| | 01:42 | your entire worksheet.
| | 01:43 | The next function is called
RANDBETWEEN, and this is new in Excel 2007.
| | 01:48 | The RANDBETWEEN function gives
you a discrete data distribution--
| | 01:52 | that is, whole numbers--between two
values that you input yourself.
| | 01:56 | So, for example, if you type
=RANDBETWEEN(, then you
| | 02:03 | can type the bottom and the top values.
| | 02:05 | So let's say that we want a
random value between one and 10.
| | 02:09 | It's just 1, 10, right parenthesis,
Return, and we get the value of 7.
| | 02:16 | And you'll notice that Excel
updated the RAND function to a new value.
| | 02:21 | Now let's talk about generating random
values under a normal data distribution.
| | 02:26 | This data distribution, which I have
displayed here, has an average or mean of
| | 02:31 | 2,000 and a standard deviation of 500.
| | 02:34 | So I can enter those values, the mean
of 2,000 and the standard deviation of
| | 02:41 | 500 in cells B7 and B8.
| | 02:46 | Now to generate random values using
those parameters, I can type equal, and then
| | 02:51 | we use the norm inverse, which is the
normal cumulative distribution inverted--
| | 02:56 | we use that to generate probabilities--
and then the probability, and this is
| | 03:01 | where we use the RAND function again.
| | 03:03 | Probability refers to a
number between zero and one.
| | 03:06 | So we use RAND, and the parentheses,
left and right, with nothing in between them,
| | 03:12 | to generate the random number.
Type our mean, which is in cell B7, and the
| | 03:19 | standard deviation, which is
in cell B8. So it's B7, B8,
| | 03:24 | and then a right parentheses. And when I
press Return, Excel generates a value of
| | 03:30 | 1,794.85, and that value occurs
here in the thicker part of the curve.
| | 03:38 | If I click Calculate Now again to
recalculate the worksheet, Excel generates
| | 03:42 | 1,542, which is about here, again
under the thicker part of the curve.
| | 03:48 | Calculate Now, 1,487. We're getting smaller,
but we're still in the thick part of the curve.
| | 03:53 | And I will just do one or two more.
| | 03:54 | 1,100, that's getting much further down.
And one last one, 2,200, which is again
| | 04:01 | within one standard deviation,
and about here on the curve.
| | 04:04 | So you can see that values beyond a couple
of standard deviations are extremely rare.
| | 04:10 | It's possible but somewhat cumbersome to
model business processes in Excel using
| | 04:14 | just worksheet data.
| | 04:16 | If you want to create more useful models,
you should learn how to implement them
| | 04:19 | using Visual Basic for Applications, or VBA.
| | 04:22 | You can learn more about VBA by
visiting in my course "Excel: VBA in Depth," which
| | 04:27 | is available in the Online Training library.
| | Collapse this transcript |
| Calculating binomial probabilities| 00:00 | When you calculate probabilities in
Excel, you need to know exactly what type
| | 00:03 | of results to expect.
| | 00:05 | In the previous movie, I showed you
how to calculate random values under the
| | 00:09 | normal data distribution.
| | 00:11 | In this movie, I'm going to show
you how to analyze trials that have
| | 00:15 | two possible outcomes: true or false.
| | 00:18 | In other words, a customer either will
or will not sign up for your mailing list.
| | 00:22 | In this movie, I'll show you how to
calculate the probability for those outcomes,
| | 00:26 | which are described by
the binomial distribution.
| | 00:30 | To calculate the binomial
probability, you need to know four things:
| | 00:34 | the probability of success, which we
have here in cell B1, the number of trials--
| | 00:40 | the number of times you attempt the
task--in cell B2, the number of successes,
| | 00:45 | which I have in this table in the left
column--so I have 0 through 20--and then
| | 00:50 | you need to know whether you want the
specific probability of an individual
| | 00:55 | outcome happening, or the
probability of an outcome or less occurring.
| | 01:00 | So I'll show you how that works
once we get into the formulas.
| | 01:03 | So I'll click cell B5, and what I want
to know is the percentage of a particular
| | 01:09 | value or more occurring.
| | 01:12 | To create the formula, I type equal in cell
B5, and then type binomialdist or "binomdist";
| | 01:21 | this is short for the
Binomial Distribution function.
| | 01:24 | Type a left parenthesis, and then we
need to know the number of successes
| | 01:27 | we're interested in.
| | 01:29 | That is in cell A5,
so I'll type the reference for A5.
| | 01:33 | It's outlined in blue to
indicate I got it right.
| | 01:35 | Then the number of trials are in cell B2,
and I don't want that cell reference
| | 01:40 | changing when Excel copies this formula
down to the remaining rows of the table,
| | 01:45 | so I will press F4 to make it an
absolute reference. Comma. And now we need to
| | 01:50 | know the probability of an individual success.
| | 01:53 | That is 35% in cell B1.
| | 01:55 | So I'll type that reference and again
make it an absolute reference. Type a comma.
| | 02:01 | Now, I need to decide if I want it
to be cumulative or what's called the
| | 02:04 | Probability Density function or the point
probability of an individual value occurring.
| | 02:10 | In this case, I want the individual value,
| | 02:12 | so I will type false. Type a right
parenthesis to close the formula. Everything
| | 02:17 | looks right. Press Return and
there we have our percentages.
| | 02:20 | So as you can see, the percentage of 0
successes is quite low, the probability
| | 02:27 | of 1 is also below 1%, as is 2, and
after that, the probability of a certain
| | 02:32 | number of successes starts increasing
all the way to 7, which is our most common
| | 02:37 | or expected value. Because we have 20
trials and a 35% probability of success,
| | 02:43 | when you multiply those two
values together, you get 7,
| | 02:46 | so it's no surprise that 7 is the
most common value. And the percentages
| | 02:50 | gradually decrease as you go
past 7 to 8, through 10, and so on.
| | 02:55 | Now, let's see what the Cumulative
Probability function would look like.
| | 02:59 | I will edit the formula in cell
B5, and I will make two changes:
| | 03:05 | first, I will change cumulative from
false to true, and then I'm going to
| | 03:10 | subtract this value from 1.
| | 03:12 | In so doing, I will calculate the
probability that a given number or more of
| | 03:17 | successes will occur.
| | 03:18 | So if I type 1 minus binomial
distribution of the arguments, everything looks good.
| | 03:24 | Press Enter and I get my results.
| | 03:26 | Now, due to a rounding error, this
should be 100%, but it's one of the vagaries
| | 03:31 | of floating-point
arithmetic inside of a computer.
| | 03:34 | You can see that the probability of
one or more is extremely high--well over
| | 03:39 | 99%--two or more is well over 98%, and so on.
| | 03:44 | So how do you use this
information once you've generated it?
| | 03:47 | Well, if one of your employees
consistently gets more than 7 of your customers
| | 03:52 | to sign up per hour on average for your
mailing list, then you can ask them what
| | 03:55 | they're doing, and use their
response as a best practice, and give that
| | 03:59 | information to your other employees.
| | Collapse this transcript |
| Calculating probabilities using the exponential distribution| 00:00 | In the previous movie, I showed you
how to calculate probabilities in the
| | 00:03 | binomial distribution, where
processes have two potential outcomes:
| | 00:07 | either true or false,
failure/success, and so on.
| | 00:11 | In this movie, I am going to talk about
the exponential data distribution, which
| | 00:15 | describes the amount of time that will
elapse between two independent events
| | 00:19 | that occur according to
a known average interval.
| | 00:22 | For example, your company might
receive an order via your web site every
| | 00:26 | five minutes on average.
| | 00:27 | You can use that information to
discover how likely it is that an order will
| | 00:32 | occur after a given time.
| | 00:34 | To do that, you use the
Exponential Distribution function, and I will
| | 00:38 | create that in cell A4.
| | 00:41 | So to find the probability of a
value occurring under the exponential
| | 00:46 | distribution, we type equal and use the
Exponential Distribution function, which
| | 00:50 | is EXPONDIST, and then a left parenthesis.
| | 00:55 | Now, we need to insert three values.
| | 00:57 | The first one is X, and X is the
number of minutes that we want to examine.
| | 01:02 | In this case, that value is in cell B4.
| | 01:06 | Type comma. And now we have lambda.
| | 01:09 | Lambda in this formula is the inverse
of the average, which we find in cell E3.
| | 01:16 | So we type "1/E3", and we want the E3
cell reference to be an absolute reference
| | 01:25 | that doesn't change,
| | 01:26 | so we'll press F4, and that
makes the reference absolute.
| | 01:30 | Type a comma. And now we need to know whether
we want a cumulative value or a point value--
| | 01:35 | in other words, the probability of
say six minutes or less occurring, or the
| | 01:40 | probability of exactly six minutes occurring.
| | 01:43 | In this case, we want six minutes or
less, so we can type "true". Type a
| | 01:49 | right parenthesis. Everything looks
right. Press Return and Excel gives
| | 01:54 | us our probabilities.
| | 01:56 | So the probability of one minute or less is about
18%, the probability of two or less is 32%, and so on.
| | 02:04 | So now how do we use this information,
and why in the world did I structure my
| | 02:08 | table this way, because normally you
would expect to see minutes on the left, and
| | 02:12 | the probabilities on the right.
| | 02:13 | Well, I did that, because what I want
to do is use the exponential distribution
| | 02:18 | to create a series of random arrival times.
| | 02:20 | In other words, I want to use random
number functions to determine the arrival
| | 02:25 | times for the next 10 orders.
| | 02:28 | To do that, you use a formula that
includes VLOOKUP, which is a lookup formula,
| | 02:35 | and I'll walk you through it as I go along.
| | 02:37 | We have an equal, and
then we have to use IFERROR.
| | 02:43 | The IFERROR function returns a value that
you specify if a formula returns an error.
| | 02:48 | The reason we have to use it in this
case is that if the VLOOKUP formula tries
| | 02:53 | to look up a value of less than 0.18,
then it will generate an error. And we
| | 02:58 | want data, not errors, in our worksheet,
| | 03:00 | so I need to tell Excel what to
do by using the IFERROR Function.
| | 03:04 | So I have iferror to start. Type a
left parenthesis, and now we can create
| | 03:09 | the VLOOKUP formula.
| | 03:10 | So, VLOOKUP, which is a vertical Lookup,
and it will use a value in the first
| | 03:16 | column of our data table to find a value
in the second column of the data table.
| | 03:21 | So we have (VLOOKUP(RAND and open
and close parentheses, which is going to
| | 03:27 | generate a random number, comma,
TimeTable, which is the name of the table that
| | 03:34 | contains the probability and the
minutes, column 2, which is where we want the
| | 03:41 | value to be returned from.
| | 03:42 | In other words, Excel using the
VLOOKUP formula always looks up the values in
| | 03:46 | the left column, and it returns
values in the column that you specify--
| | 03:51 | in this case, we want it to be the
second column. And then we need to tell Excel
| | 03:55 | whether we want an exact
match or approximate match.
| | 03:58 | Because we are looking at values that
are either less than or greater than other
| | 04:01 | values, we need to use an
approximate match, and that is true.
| | 04:07 | Now, right parenthesis. But I'm not
quite done because I need to tell Excel
| | 04:10 | what to do if it encounters an error--
| | 04:13 | in other words, if the RAND function
within the VLOOKUP formula generates the
| | 04:17 | value of less than 0.18.
| | 04:18 | Well, if that's the case, we want the value
to be 1 because that is the number of minutes.
| | 04:23 | So type a comma, 1, a right parenthesis to
close the formula. Press Enter and we get 1.
| | 04:30 | Now, I'm going to copy this formula
down, so that it covers those values, and
| | 04:36 | you see that we have our
arrival time: 7, 7, 1, 2, 4, and so on.
| | 04:41 | If I recalculate the worksheet by going
to Formulas and clicking Calculate Now,
| | 04:46 | we get a different set of arrival times.
| | 04:49 | Another way you can analyze your data
using the exponential distribution is
| | 04:53 | to see what is the probability of an order
happening within the next say four to seven minutes?
| | 04:59 | So we have our probabilities here, and
we know the probabilities for four minutes
| | 05:03 | and the probability for seven minutes or less.
| | 05:06 | So that means we can subtract the
probability of four minutes from the
| | 05:09 | probability of seven minutes to find out
the probability that an order will occur
| | 05:13 | within four to seven minutes.
| | 05:15 | So to do that, type equal. And then the
first value we want is the probability of
| | 05:20 | an order occurring in seven minutes or less.
| | 05:23 | That's in A10. And we subtract the
probability of an order in four minutes or
| | 05:30 | less. That is in A7.
| | 05:31 | You can see the two cells
highlighted here in blue and green.
| | 05:36 | When we press Return, Excel give
us a probability of about 20%.
| | 05:41 | The exponential distribution helps you
model the times between events, such as
| | 05:45 | customers placing orders on your web site.
| | 05:48 | You should track these
important statistics consistently.
| | 05:51 | As your business grows, you'll
probably find that the average time between
| | 05:54 | orders decreases, and you'll
need more help to process them.
| | Collapse this transcript |
| Calculating values using the Poisson distribution| 00:00 | Most companies have web sites, and most of
them allow their customers to place orders.
| | 00:05 | You can use the exponential data
distribution to describe the amount of
| | 00:07 | time between two orders.
| | 00:09 | In this movie, I'll show you how to
use the Poisson distribution, which was
| | 00:13 | discovered by a gentleman name Poisson,
to describe the number of events that
| | 00:17 | will occur within a given
time, such as orders per hour.
| | 00:20 | So in this example, I'm assuming
that my web site receives an average of
| | 00:25 | six orders per hour.
| | 00:27 | So my goal is to create a table of the
probability that I will receive 0 orders,
| | 00:32 | 1 order, 2 orders, and so on every hour.
| | 00:35 | I can then use those values to create a
timeline showing me how many orders that
| | 00:41 | I will receive on a random basis
just over the course of 10 hours.
| | 00:45 | To calculate the Poisson distribution
of 0--in other words the probability of
| | 00:49 | receiving 0 orders per hour--I
create this formula, clicking in cell A2:
| | 00:56 | =P-O-I-S-S-O-N, Poisson, and x.
| | 00:59 | x is the number that I want
to examine, and that is in B2.
| | 01:05 | The mean is the average, and that is
in cell E1. And I want to make this
| | 01:11 | reference absolute, so it doesn't
change when Excel copies the formula down to
| | 01:14 | the rest of the rows in the table.
| | 01:16 | So I will press F4 to make that an
absolute reference. Type a comma. And then
| | 01:21 | the question is whether I want
this to be cumulative or not.
| | 01:24 | In other words, do I want to know the
probability of getting exactly 0, or do I
| | 01:28 | want to know the
probability of getting 0 or less?
| | 01:32 | In this case, I want to know the
probability of 0 or less, so cumulative is true.
| | 01:39 | Close with a right parenthesis, press
Return, and Excel fills in my table.
| | 01:45 | So you see that the probability of
getting 0 orders is quite low. Then I get up
| | 01:49 | to 6% of the time I'll get 4 orders,
15% of the time I'll get 5 or less, 28% of
| | 01:55 | the time 6 or less, and so on.
| | 01:57 | So now the question is, how do I
use these values meaningfully in a
| | 02:01 | business situation?
| | 02:02 | The answer is that you can model your
business processes and get an idea of how
| | 02:07 | many orders you could receive, so
that you can analyze your data and your
| | 02:11 | business operations.
| | 02:12 | The first thing we need to do is create
a series of random values between 0 and
| | 02:17 | 1, and to do that we use the RAND
function. Type =RAND() and press Return.
| | 02:28 | When I do, Excel displays a series of
random values in the second table that I have
| | 02:33 | here in columns D and E.
| | 02:35 | Now I don't want these value changing
every time I recalculate the worksheet,
| | 02:38 | and they would because rand is a
volatile function that updates its value every
| | 02:43 | time the worksheet changes,
| | 02:45 | so I will select the entire
column, and then click Copy.
| | 02:50 | I could also press Ctrl+C. And then on
the Clipboard group of the Home tab of the
| | 02:56 | Ribbon click the Paste button's
down arrow, and click Paste Values.
| | 03:01 | When I do, Excel replaces the
formula with the result of the formula.
| | 03:05 | Now I can create a series of VLOOKUP
formulas that find the corresponding
| | 03:09 | probability--say, for example, 0.16--
over here in the Probability table.
| | 03:16 | So the value 0.16 would return 4
because the value of 0.28 is larger than 0.16,
| | 03:24 | and the value of 0.15 is smaller,
so that means that we will have the
| | 03:28 | probability of four events during that hour.
| | 03:32 | So the formula that I create is
=VLOOKUP(, and now the Lookup value. The value
| | 03:41 | I'm going to be looking up is in cell D4.
| | 03:45 | So type that in. Type a comma.
| | 03:47 | Now we have the table_array, and that's
the name of the table that contains the
| | 03:51 | values. And that is Poissontable.
Just type that in. And then we have the
| | 03:59 | column index number.
| | 04:00 | The column index number is the number
of the column, counting from left to right,
| | 04:04 | that will return the value for this formula.
| | 04:07 | In this case, I want the values
in the second column, column B:
| | 04:11 | the number of events during a given
time period, so I will type in 2. And then
| | 04:16 | we need to put in whether we want an
approximate match or an exact match.
| | 04:20 | In this case, we are allowing an
approximate match because it's extremely
| | 04:24 | unlikely that any of these values will
match up to any of these values, so we
| | 04:28 | are looking for less than or equal to.
| | 04:31 | That means we can use an approximate
match, and I will press Tab to accept TRUE.
| | 04:36 | Everything looks good. Type right
parenthesis to close, and we receive a series
| | 04:42 | of values that indicate the number of
orders that will occur within the next
| | 04:46 | time period--in this case an hour.
| | 04:48 | Like the exponential distribution the
Poisson distribution lets you model your
| | 04:52 | company's operations by creating random
numbers of orders per hour based on the
| | 04:56 | known or assumed average.
| | 04:57 | Also like the exponential
distribution, you can use the Poisson
| | 05:00 | distribution most effectively if
you know how to build programs using
| | 05:03 | Visual Basic for applications.
| | Collapse this transcript |
| Calculating dependent trials using the hypergeometric distribution| 00:00 | If you've ever read anything about
statistics, or math in general, you might have
| | 00:04 | encountered problems that tell you you
have a vase with 15 marbles, 10 of which
| | 00:08 | are red and 5 of which are blue, and
asks you to calculate the odds of selecting
| | 00:12 | 4 marbles, and picking exactly 2 red and 2 blue.
| | 00:16 | Although these situations are somewhat
rare, business analysts do occasionally
| | 00:20 | need to perform this type of
calculation, which is described by the
| | 00:24 | hypergeometric distribution.
| | 00:26 | Now, let's suppose your company wants
to give away cash prizes in a drawing,
| | 00:30 | which is structured as follows.
| | 00:32 | You have 100 envelopes in a barrel.
20 of the envelopes contain $500, and the
| | 00:37 | other 80 contain $20.
| | 00:39 | A lucky customer can draw 10 envelopes,
and keep whatever money they find.
| | 00:42 | What you want to know are the odds of a
customer drawing no $500 envelopes, one
| | 00:47 | $500 envelope, and so on.
| | 00:50 | To calculate the probability of
drawing a specific number of target items--in
| | 00:54 | this case $500 envelopes--from a
collection, you need to know four things:
| | 00:58 | The specific number of items drawn--
that means the items drawn of 10.
| | 01:04 | As I stated earlier, you're allowing the
customer to draw 10 envelopes from the barrel.
| | 01:08 | Second, you need to know the number
of target items in the collection, and
| | 01:12 | that is the number of $500
envelopes, and that is 20.
| | 01:16 | Then you need to know the total number of
envelopes that are available--that is 100.
| | 01:20 | Finally, you need to know
the number of $500 draws.
| | 01:25 | In other words, you want to calculate
the probability that the customer will
| | 01:29 | draw zero $500 envelopes, one
$500 envelope, two, and so on.
| | 01:34 | I'll show you some more things that you can do
with those numbers toward the end of this movie.
| | 01:39 | But for now, let's just find the
probability of your lucky customer drawing
| | 01:43 | zero $500 envelopes.
| | 01:45 | I'll click in cell E2, and type "=hyp".
| | 01:51 | The bottom function after I type
"hyp" is hypergeometric distribution.
| | 01:56 | I won't try to pronounce the abbreviated form.
| | 02:00 | But when that is highlighted, and I
press the Down Arrow key to highlight it,
| | 02:03 | I'll press Tab, and we have the function.
| | 02:06 | Now, I can start adding in the arguments.
| | 02:08 | So we have the number of successes.
| | 02:09 | That's the first argument,
and that value is in cell A2,
| | 02:13 | in other words, the number of $500
envelops they draw. Then a comma.
| | 02:17 | Next is the number of items that are drawn.
| | 02:20 | That's in cell B2. The number of
successes that are available in the population.
| | 02:24 | In this case, that means
the number of $500 envelopes.
| | 02:27 | So, we can put that in C2.
| | 02:30 | Then the number of items in the
entire population, including both successes
| | 02:33 | and what we're determining failures
or $20 envelopes, and that is 100, and
| | 02:37 | that is in cell D2.
| | 02:39 | Press the right parenthesis, press
Return, and Excel fills the formula down to
| | 02:44 | the rest of the cells in this table column.
| | 02:47 | So we can see that the probability of
drawing zero envelopes with $500 in them
| | 02:52 | is about 9.5%, the probability of drawing
one is 26, almost 27%, the probability of
| | 02:58 | drawing two is about 32%, and so on.
| | 03:02 | Now to do further calculation, we
need to determine what each prize
| | 03:05 | drawing would be worth.
| | 03:06 | So, for example, if a customer draws zero
$500 envelopes, how much money would they get?
| | 03:12 | The way that we calculate that is
we type equal, and then the number of
| | 03:16 | $500 draws times 500.
| | 03:19 | So that would be (A2*500)+(10-a2)*20.
| | 03:36 | So in other words, we are taking the
number of $500 draws from cell A2, and
| | 03:40 | multiplying it by 500, and we're adding 10
minus that number, and multiplying it by 20.
| | 03:46 | Press Return, and we see that the value
for each of the prizes goes up by $480.
| | 03:51 | When we replace a $20 envelope with a
$500 envelope, the difference between
| | 03:56 | those two values is 480.
| | 03:57 | So these values make sense.
| | 03:59 | We can see that if you draw one $500
envelope, you get 680, 1160 if you draw
| | 04:05 | two, and so on, all the way up to--if
you get incredibly lucky, run between the
| | 04:09 | raindrops and draw ten $500 envelopes,
you would win $5,000, which is the
| | 04:14 | maximum price available.
| | 04:16 | Now, we can calculate
what's called the expected value.
| | 04:19 | In other words, you multiply the
probability of a particular outcome by the
| | 04:23 | value or prize for that outcome.
| | 04:26 | After I calculate the expected value
for each one of the possibilities, then
| | 04:30 | I can add all those expected values together
to get the total expected value for the drawing.
| | 04:35 | So to do that, I will just
multiply the probability by the prize.
| | 04:39 | So that's =E2*F2. Press Return and
we get our expected value for each of
| | 04:49 | these possibilities.
| | 04:50 | Now, note that the values are small
enough for drawing nine and drawing ten that
| | 04:55 | the expected value is displayed as 0.
| | 04:57 | But if I increase the number of
decimal points for the entire column, you see
| | 05:03 | that if you look far enough that there
is an expected value, but it's about one
| | 05:08 | ten-thousandth of a cent.
| | 05:10 | Now what I can do is add
a total row to the table.
| | 05:13 | To do that, click any cell in the table.
| | 05:15 | Then on the Design contextual tab, in
the Table Style Options group, click the
| | 05:20 | Total Row check box. We get the sum.
| | 05:24 | We see that the expected value is $1,160.
| | 05:28 | The hypergeometric distribution doesn't
come up very often in business, but if
| | 05:32 | you want to structure a prize drawing
of the type I described, you'll know how
| | 05:35 | to figure out the odds and the expected value.
| | 05:38 | Check your local laws, make sure this
sort of thing is legal, and be sure to set
| | 05:41 | aside enough money to pay the maximum
prize in case someone gets really lucky.
| | Collapse this transcript |
| Calculating probabilities, permutations, and combinations| 00:00 | If you studied probability in school,
you've most likely calculated odds, and
| | 00:04 | found the number of ways you can
combine various items into groups.
| | 00:07 | In this movie, I'll show you how to
calculate those probabilities, and when to
| | 00:11 | use which technique.
| | 00:12 | To calculate the probability that
something will happen, you need to divide the
| | 00:16 | number of desired outcomes by the
total number of possible outcomes.
| | 00:20 | For the purposes of this
exercise, we are assuming that all the
| | 00:25 | probabilities are equally likely.
| | 00:27 | For example, if you flip a coin--at
least in theory--it is equally likely to
| | 00:31 | come up heads or tails, 50% either way.
| | 00:34 | But let's see how we calculate that value.
| | 00:35 | Well first, we need to know
the number of possible outcomes.
| | 00:39 | So in cell C4, I will type 2, because we
have heads and tails as possible outcomes.
| | 00:45 | Then the number of desired outcomes,
which is the probability that a coin
| | 00:49 | will land heads up,
| | 00:50 | there is only one way we can succeed, so type 1.
| | 00:54 | Then we can divide the number of
desired outcomes, which is in cell C5, by the
| | 01:00 | total number of possible outcomes,
which is in C4, press Return, and we get a
| | 01:06 | probability of 0.5, which is correct.
| | 01:08 | When you flip a coin, it can land
either heads or tails and the probability
| | 01:12 | of either one is 0.5.
| | 01:13 | Now, let's say that you want to
calculate the probability that a coin will land
| | 01:17 | heads up twice in a row.
| | 01:19 | To do that, you need to multiply the
probability of it happening once by the
| | 01:23 | probability of it happening once again.
| | 01:25 | So, for example, what is the probability
of a coin landing heads up? Well, it's 0.5.
| | 01:30 | We knew that from above.
| | 01:31 | What's the probability that it happens again?
| | 01:34 | These are independent trials.
| | 01:36 | The coin has absolutely no memory
that it landed heads up, so it is 0.5.
| | 01:42 | To calculate the total probability,
you multiply those two values together.
| | 01:46 | So it's =C9*C10. Return. And you get a
total probability of a coin landing heads
| | 01:54 | up twice in a row of 0.25, or 1 out of 4.
| | 01:58 | If you visualize the situation, it makes sense.
| | 02:01 | If you flip two coins, there are
four ways that they can come up:
| | 02:04 | either two heads, which is the success
that we're looking for, or they can come
| | 02:09 | up heads and tails, tails
and heads, or two tails.
| | 02:13 | So you have one success, three
failures, and one out of four is 25%.
| | 02:19 | Now, let's say that you want
something or something else to happen.
| | 02:24 | In other words, if you roll a
standard 6-sided die, you want to know the
| | 02:27 | probability of a 1 or a 2.
| | 02:30 | To calculate that, you need to know
the probability of a 1 coming up, and
| | 02:33 | that is 1 out of 6.
| | 02:35 | So, in cell F4, I'll just type =1/6.
| | 02:40 | Then there is the probability of a 2.
| | 02:42 | That's exactly the same.
| | 02:43 | You only have one 2 on a 6-sided die,
so the probability is =1/6. The total
| | 02:51 | probability of getting one thing or
another thing where you have two possible
| | 02:54 | successes is to add the probabilities together.
| | 02:58 | So you have =F4+F5. Return. And you see
you get a total probability of 0.33 or
| | 03:07 | one-third, which makes sense, because
2 out of the 6 sides of the die will be
| | 03:12 | considered successes,
| | 03:14 | so that means you have 2/6, which is
one-third, which is what we see here.
| | 03:18 | The final example I use when it comes
to analyzing this type of probability is
| | 03:22 | the probability that something won't happen.
| | 03:25 | To do that, you subtract the
probabilities of something happening from one.
| | 03:30 | So, for example, let's say that I
don't want the die to come up 1 or 4.
| | 03:33 | Well, what's the probability of a 1?
| | 03:36 | Well, that's 1/6th, because it's 1 of
6 possible values. So it's =1/6.
| | 03:42 | The probability of a 4
coming up is exactly the same.
| | 03:44 | There is only one 4 available on
a 6-sided die. So that's =1/6.
| | 03:49 | Then the total probability is =1-(F9+F10).
| | 03:59 | So remember, these are the
probabilities that I don't want,
| | 04:03 | so I subtract them from 1.
| | 04:05 | When I press Return, we get 0.666667, which
is equivalent to two-thirds, which make sense.
| | 04:12 | I don't want a 1 or a 4 to come up.
| | 04:14 | So that means that there are four
possible ways to succeed rolling any number
| | 04:18 | other than a 1 or a 4.
| | 04:20 | That means you divide 4 by 6, which
is two-thirds, which is the value that
| | 04:25 | you see in cell F11.
| | 04:26 | Now, let's look at a different type of
probability calculation, and those are on
| | 04:30 | the second worksheet.
| | 04:33 | These are combinations and permutations.
| | 04:36 | In business, suppose you create random grab
bags of items that customers can purchase.
| | 04:40 | If there are 15 different items
divided into 4 item bags, how many different
| | 04:44 | combinations are there?
| | 04:46 | In this case, we're not worried about order.
| | 04:48 | In other words, if you reach into a bag,
and you pull out DVDs, you don't care
| | 04:52 | of the order that they come out.
| | 04:53 | You just want to know what you got.
| | 04:55 | So in this case, we're going to
try to figure out how many different
| | 04:57 | combinations there are of 15
DVDs when there are four items per bag.
| | 05:02 | So the information we need is
15, which I will type in cell C5.
| | 05:08 | That's the total number of
different DVDs that are available.
| | 05:11 | Then we have the number of items per
bag, which is 4, which I type into cell C6.
| | 05:16 | Then we have the total combinations.
| | 05:18 | To calculate the number of
combinations, we type Equal, and then we use the
| | 05:22 | combinations or COMBIN function, left
parenthesis, and then we have the number
| | 05:28 | which is the number of
different DVDs that are available.
| | 05:31 | So type C5, which is that cell reference,
a comma, and then we have the number of
| | 05:36 | items per bag, which is 4,
and that is in cell C6.
| | 05:41 | Type a right parenthesis.
Everything looks good. Return.
| | 05:44 | We see that we get total combinations of 1,365.
| | 05:48 | Permutations are like combinations,
but in this case order matters.
| | 05:53 | So, for example, suppose you have eight
people working in a tradeshow, and you want
| | 05:57 | to assign one person to
each of four shifts in your booth.
| | 06:00 | How many different booth
shift orders are possible?
| | 06:03 | Well, to do that, you need
to use a permutations formula.
| | 06:07 | So first, I need to put in my total
number of individuals, and that is 8, and
| | 06:13 | then the total number of
shifts, which is 4. Press Enter.
| | 06:17 | Now, I can create my permutations formula.
| | 06:19 | That is =PERMUT(, and then we use
the same arguments as we did for the
| | 06:27 | combinations formula.
| | 06:29 | So we have the number, which is the
number of individuals, and that is in cell
| | 06:32 | C11. Type a comma, and then
we have the number chosen.
| | 06:36 | In this case, that is the number of
shifts, and that is in C12. Type a right
| | 06:41 | parenthesis. Everything looks good.
| | 06:44 | We see that we have a total
number of different orders of 1,680.
| | 06:48 | You can notice that even though we had
15 DVDs to choose from and 4 items per
| | 06:53 | bag, the total number of combinations
is less than the number of permutations
| | 06:58 | when you select 4, but only
from a group of 8 individuals.
| | 07:03 | When order matters, the number
of possibilities expands rapidly.
| | 07:07 | Now finally, a word of warning about
probabilities, and I'll switch back to the
| | 07:11 | Probabilities worksheet, so you
can see what I'm referring to.
| | 07:14 | You must avoid what's called the
gambler's fallacy, which is thinking that an
| | 07:18 | outcome is due because it
hasn't happened recently.
| | 07:21 | If you flip a fair coin, and it
comes up heads three times in a row, the
| | 07:24 | probability that it will come up heads the
next time you flip it is still 50%, not 6.67%.
| | 07:32 | The probability that a coin will come
up heads four times in a row before you
| | 07:35 | make the first flip is 6.67%, or 1 out of 16.
| | 07:40 | But after it's come up heads three times,
you're still even money to get a fourth.
| | 07:45 | The coin is not due to come up heads.
| | 07:47 | It's an easy mistake to make, but
remember that independent trials are
| | 07:50 | just that, independent.
| | Collapse this transcript |
|
|
6. Sampling Values from a PopulationSelecting effective data samples| 00:00 | Whenever you want to select members of
a data set as samples for your analysis,
| | 00:04 | you need a selection strategy
that produces usable, unbiased data.
| | 00:08 | In this movie, I'll tell you
how to select values you can use.
| | 00:11 | The short answer, which I'll elaborate
on a bit, is to select your data randomly.
| | 00:16 | The more random you can make your sample,
the more valid your results will be.
| | 00:20 | So, for example, surveying customers
who enter your store during a particular
| | 00:23 | hour could significant bias your data
by focusing on a demographic group that
| | 00:27 | frequents the stored during those times.
| | 00:29 | Another way that you could work with
your customers and select them randomly
| | 00:33 | is by selecting, say, every fifth customer,
which I will do with my customer list here.
| | 00:38 | Now I have a series of 10 rows with 10
customer IDs, and what I want to do is to
| | 00:46 | select every fifth customer. And that
means that I want to find any row with a
| | 00:50 | number that is divisible by five.
| | 00:53 | That's the easiest way to
select every fifth value.
| | 00:56 | So what I'm going to do is create an
IF formula, =IF, and then a left
| | 01:01 | parenthesis and then within the If
formula, I want to find a way to test if the
| | 01:06 | row number that I'm in is evenly divisible by 5.
| | 01:10 | If that's the case, then I will
treat it as every fifth value.
| | 01:14 | And the way that you do that is you use MOD.
| | 01:17 | MOD returns the remainder after a
numbers divided by another number.
| | 01:22 | And what we're going to be testing
for is whether that remainder is zero.
| | 01:27 | So we have MOD, and then
I'm looking for row number.
| | 01:29 | So I'll type a left parenthesis and type in ROW.
| | 01:33 | That returns the row number of a reference.
| | 01:35 | In this case, it will be within another
set of parentheses, and that will be cell A2.
| | 01:42 | And I will leave the cell reference
relative, so that it will change when I
| | 01:46 | copy the formula down.
| | 01:47 | I close the TOW function with a
right parenthesis. Type a comma, and now I
| | 01:52 | can put in my divisor.
| | 01:54 | And again, I'm working in the MOD function
which needs a number, which is the row number.
| | 01:59 | And then I need the divisor, and that is
the number 5, because I want every fifth row.
| | 02:04 | Type a right parenthesis. And I'm checking
to see if the row is evenly divisible by 5--
| | 02:11 | in other words, that the
modulo or modulus is equal to zero.
| | 02:17 | So I type "=0,", and
now I'm back in my IF function so I need
| | 02:23 | to tell Excel what to do if the value is true,
and I'll just have it display the value true.
| | 02:28 | So, I type "True". Type a comma, and
| | 02:33 | now what to do if the value is
false, and I'll just a display false.
| | 02:37 | So I put False in double quotes.
Everything looks good. Type a right parenthesis
| | 02:42 | to close the IF function, press Return,
and Excel correctly returns the result
| | 02:47 | that this row number of row 2
is not evenly divisible by 5.
| | 02:51 | Now, that doesn't mean the formula is
working correctly, but if I copy it down
| | 02:55 | by dragging the fill handle until it
covers cell B11, we'll see that row 5 is
| | 03:02 | True and row 10 is True,
| | 03:03 | so my formula is working correctly.
| | 03:06 | If you want to select every fifth
customer, that's one way you can do it.
| | 03:09 | Now technically, my first
row in this table is a heading.
| | 03:14 | So that means that the first customer I
select will actually be the fourth customer,
| | 03:18 | so you have customer 4, customer 9, and
so on. But the important point is that
| | 03:23 | we are selecting every row
number that is divisible by five.
| | 03:27 | So after the first one, which is the forth,
every fifth customer after that will be selected.
| | 03:32 | Now let's say that you wanted a probability
that an individual customer would be selected.
| | 03:37 | So instead of every fifth customer,
let's say that you want a 25% probability
| | 03:41 | that any individual customer would be selected.
| | 03:44 | You can do that, and I will
replace the formulas here.
| | 03:48 | So I'll type "=IF(",
and I'm going to use a RAND value, and
| | 03:54 | that is a random value between 0
and 1, and it's a decimal value.
| | 03:58 | So Rand(). And I'm going to check to
see if the value is <0.25, because we
| | 04:08 | want a 25% probability.
| | 04:10 | Type a comma. And if the value is true,
I'll have it display True, so I am going to put
| | 04:15 | True in double quotes.
| | 04:18 | And then if is value is false, say
False so we have a double quote, False, close
| | 04:23 | the double quotes. Right parenthesis.
Everything looks good. Press Return and
| | 04:29 | this value happened to come up False this time.
| | 04:31 | If I drag the fill handle from cell B2
down, you'll see that I had three values
| | 04:38 | out of 10 that are true.
| | 04:40 | If I recalculate the worksheet by
going to the Formulas tab and clicking
| | 04:44 | Calculate Now, then you'll see
that the random value has changed.
| | 04:50 | If I want to keep the values that I
have right now, I can select them, and then
| | 04:56 | on the Home tab of the Ribbon, copy--I
can also press Ctrl+C to copy--click the
| | 05:01 | Paste button's down
arrow and click Paste Values.
| | 05:04 | Now, if I were to recalculate the
worksheet--and I'll press Escape to undo the
| | 05:09 | selection here--click Formulas,
Calculate Now nothing changes, because these are
| | 05:14 | no longer formulas they are actually the values.
| | 05:17 | That's a useful technique you can use
to select random values from a relatively
| | 05:21 | short list, and by short in this case,
I mean maybe several hundred values.
| | 05:25 | When you get lists beyond that, which are
very common in business, then it can get
| | 05:30 | a bit unwieldy to use this technique.
| | 05:32 | So instead what you can do is have
what's called a trigger value, which looks for
| | 05:37 | a particular instances, such as a
customer who orders at least six bottles of
| | 05:42 | jalapeno-infused olive oil. And then you
select the next order. Perhaps you would
| | 05:47 | give that customer survey,
| | 05:48 | you would offer that customer a coupon,
something like that, but the idea that
| | 05:52 | I want to get across is that you can use a
trigger event to select the next order that occurs.
| | 05:58 | You never know. There might be a specific group of
customers that orders six bottles of jalapeno
| | 06:02 | olive oil at the same time, and you
don't want to focus on customers who have
| | 06:06 | something in common.
| | 06:07 | You want to select randomly.
| | 06:09 | So the formula we're going to
create is a little more complicated.
| | 06:12 | It is a nested IF function.
| | 06:14 | That means we're going to check for a
condition in the IF function, and then if
| | 06:18 | it isn't true, we're going to
create another IF function within it.
| | 06:22 | So I'll type "=IF(", and now I need to
set up my logical test. The cell that I'm
| | 06:29 | going to check is cell E1, and I will
explain why it's E1 and not E2 in a second.
| | 06:35 | And because I'm creating a nested IF
function, I need to test for the opposite
| | 06:38 | condition of what I actually want.
| | 06:40 | So in other words, I want to find rows
that contain jalapeno, but because the
| | 06:45 | second IF function needs to go in the
value_if_false argument area, I need to
| | 06:50 | say that E1 is not equal to jalapeno.
| | 06:54 | So to type not equal, you hold down the
Shift key and type a less than sign and
| | 06:59 | then a greater than sign.
| | 07:01 | Now, I can indicate the value jalapeno
by typing a double quote, because it's a
| | 07:06 | text string J-A-L-A-P-E-N-O and close
the double quote. And now, I can type a
| | 07:13 | comma. And the reason that I am using
cell E1 instead of cell E2 is because I
| | 07:19 | want my formula to display true for
rows where the order previous to it was for
| | 07:25 | at least six bottles of
jalapeno-infused olive oil.
| | 07:28 | So in other words, I want this row with
the garlic 4 values, because that is the
| | 07:35 | row directly below jalapeno with 6.
| | 07:38 | So to do that, if I want this cell to
be true, I need to check cells E8 and F8.
| | 07:43 | So I just start doing that here at the top.
| | 07:45 | So I have jalapeno, and now
what to do if the value is true.
| | 07:50 | In other words, if it's not equal to
jalapeno, well, that's not what I want, so I
| | 07:54 | will have it display False.
| | 07:55 | To do that, I type a double quote the
word False and then close the double
| | 07:59 | quotes, now a comma. Now I'm in the value
with false argument of the IF function.
| | 08:04 | So I can type another IF, left parenthesis,
and I'm checking if the value in cell
| | 08:10 | F1 is greater than, or equal to, 6. Type a comma.
| | 08:15 | What do I do if the value is true?
| | 08:17 | Well, have it display true.
| | 08:18 | So I type a double quote the word
"True" double quote again. Type a comma.
| | 08:23 | And if the value is false, in other
words if it is jalapeno, but it's not
| | 08:28 | greater than or equal to six
bottles, then I have it display False.
| | 08:31 | So I type a double quote False double quote.
| | 08:35 | So I have my second IF
function with my right parenthesis.
| | 08:38 | I need another right parenthesis to
close the first one. Press Return and I
| | 08:43 | get my result of False.
| | 08:44 | Now if I copy the
formula down to cover cell G11,
| | 08:48 | so I'll grab the fill handle at the
bottom-right of cell G2, copy the formula
| | 08:53 | down, and you'll see that row 9--
this order here garlic 4--is correctly
| | 09:01 | identified because it is the order
immediately following an order for jalapeno
| | 09:06 | with at least 6 bottles.
| | 09:08 | Selecting sample values at random
is harder than it sounds, but be sure
| | 09:12 | that your selection method doesn't
introduce a hidden bias that can skew your results.
| | 09:16 | Remember, random is best.
| | Collapse this transcript |
| Inferring a population mean from a sample mean| 00:00 | Companies often interact with
data sets that are difficult to
| | 00:02 | summarize effectively.
| | 00:04 | For example, the Two Trees Olive Oil
Company might have shipped several hundred
| | 00:07 | wholesale orders over the past few months,
| | 00:09 | but you want to analyze the sample of
the values, and try to infer information
| | 00:14 | about the entire order population.
| | 00:16 | In this movie, I'll show you how to
analyze the average value of a sample, and
| | 00:21 | use that estimate to make judgments
about the average of the entire population.
| | 00:24 | As a note, this technique only works
for data sets with 30 or more values,
| | 00:28 | so you need to have a large sample size.
| | 00:31 | The first step is to find the mean, or
the average, of the selected sample values.
| | 00:36 | To do that, in cell D3,
type the formula: =average--
| | 00:42 | and average and mean are the same
thing. Average and then the name of the
| | 00:46 | table. On the left side of the screen
is Orders, so I type in Ord and then Tab
| | 00:52 | to accept the formula AutoComplete
reference, then a left square bracket, and
| | 00:57 | the table column we want is Total,
| | 00:59 | so I just type a T and then press
Tab to accept the AutoComplete.
| | 01:03 | Right square bracket and then a right
parenthesis to close the formula, and press
| | 01:08 | Enter, and we see that the
average of the sample values is 4,899.
| | 01:13 | Next, we need to calculate the
standard deviation of the sample.
| | 01:17 | We sampled 50 values from a
list of several thousand orders,
| | 01:20 | so we will use this formula. It's =stdev,
| | 01:25 | which is standard deviation for a sample,
left parenthesis, and then we want the
| | 01:30 | standard deviation of the values in this table.
| | 01:34 | So I'll click the Totals header
when my mouse pointer is a downward-
| | 01:37 | pointing black arrow.
| | 01:38 | That will select the entire column.
And right parenthesis, that ends that
| | 01:43 | portion the equation.
| | 01:44 | Now divide it by the
square root, and that is sqrt.
| | 01:48 | That's the square root function in
Excel. Left parenthesis. And then what we
| | 01:53 | want is the count of the orders in this column.
| | 01:57 | So square root, then we use the COUNT
function. Another left parenthesis. And we
| | 02:02 | are counting the number of entries in
the Orders table's Totals column, so
| | 02:07 | select the column again, and now a
right parenthesis to close that portion of
| | 02:14 | the equation that was the COUNT function.
| | 02:17 | Now another right parenthesis, and
that closes the square root function.
| | 02:21 | So we have the standard deviation
divided by the square root of the count of
| | 02:24 | the number of orders in that table column,
press Enter, and we get the value of 254.896.
| | 02:30 | Now we can apply the
central limit theorem to our data.
| | 02:33 | The central limit theorem states that
when you select simple random samples of
| | 02:37 | size n from a population, the
sampling distribution of the sample mean can be
| | 02:43 | approximated by a normal probability
distribution as the sample size becomes
| | 02:47 | large--and by large, we mean 30 or more entries.
| | 02:51 | Stated more simply, what that means is
that when you have a sample size of 30
| | 02:55 | or more values, you can assume
that the entire data distribution is
| | 02:59 | distributed normally.
| | 03:01 | That is what makes it
possible to use these techniques.
| | 03:04 | So now we know the mean
and the standard deviation.
| | 03:07 | So we could ask questions about the data.
| | 03:09 | For example, how likely is it that we
got a sample of size n--in this case
| | 03:13 | size 50--with a sample mean between 4,700
and 5,200, if in fact the population mean
| | 03:20 | is actually some other value.
| | 03:23 | To find that out, you can calculate
the probability that a value is less than
| | 03:27 | the upper bound, in this case 5,200,
by creating this formula in cell D6.
| | 03:33 | So you type "=normdist", and this is the
Normal Distribution function, so that's
| | 03:40 | why we need to be able to assume that
this data is distributed normally, value x,
| | 03:45 | which is 5200, the mean, which is in
cell D3, comma, then the standard deviation,
| | 03:54 | which is in cell D4, comma, and then
finally, we need to indicate whether we want
| | 03:59 | it to be cumulative or true.
| | 04:02 | Cumulative would find the probability that
the population mean is any value less than 1,500.
| | 04:08 | If we were to say false, in other words
that we wanted the probability density
| | 04:12 | function or the individual value
function, then that means we would be
| | 04:16 | calculating the probability that
the population mean is exactly 5,200.
| | 04:20 | That's extremely unlikely.
| | 04:22 | It would be a very low probability.
| | 04:23 | So instead, what we want to do is
discover if the population mean is less than
| | 04:28 | 5,200, or the probability that it is.
| | 04:31 | So we can type "True", right
parenthesis to close, press Return, and we get a
| | 04:38 | probability of about .88,
| | 04:41 | so about an 88% probability that
the population mean is below 5,200.
| | 04:45 | Now let's do the same thing
to discover the probability that the population
| | 04:49 | mean is less than 4700.
| | 04:52 | To do that, type a similar formula.
It's =normdist(4700.
| | 05:00 | The mean is in cell D3, comma, standard
deviation, D4, comma and again True. So we
| | 05:06 | get the cumulative function or
probability, in other words, the probability
| | 05:11 | that the mean is less than 4,700.
Right parenthesis. Return. And that
| | 05:15 | probability is about 211/2 %.
| | 05:19 | If we want to find the probability
that the population mean is between 4,700
| | 05:23 | and 5,200, we can subtract the lower
bound, in this case 4,700, from the upper
| | 05:28 | bound, which is 5,200.
| | 05:30 | To do that, we create the formula
=D6-D7. Press Return and we get a
| | 05:38 | probability of about two-thirds,
about 66.4% that the population mean is
| | 05:43 | between 4,700 and 5,200.
| | 05:46 | In this case, it is about two-
thirds likely that the true mean value of
| | 05:50 | wholesale orders is between $4,750 and $5,200.
| | 05:54 | If you were to change the values in
the formulas, you could determine the
| | 05:57 | probability that the mean is
between 4,500 and 5,400, and so on.
| | 06:02 | Once you know those probabilities, you
can use your domain knowledge to assess
| | 06:06 | your information and make
decisions about your business.
| | Collapse this transcript |
| Inferring population distributions based on a proportion of the data set| 00:00 | In the previous movie, I showed you how
to estimate the mean of population data
| | 00:04 | based on the mean of sample
values taken from that population.
| | 00:07 | In this movie, I'll show you how to
infer some other characteristics of
| | 00:10 | the population's data distribution,
such as the population proportion,
| | 00:15 | given a similar sample.
| | 00:17 | Our goal is to find the expected value
and standard deviation of all possible
| | 00:21 | samples of a given size from the population.
| | 00:24 | In this case, I'll use a sample size of 50,
which is taken from a population of 3,000 values.
| | 00:30 | The business case, or business scenario,
is that we know there was 25% chance that
| | 00:36 | any individual visitor to your
web site would be offered a coupon.
| | 00:40 | So the expected value is 0.25.
| | 00:44 | So in other words, if I were to visit
the Two Trees Olive Oil Company's web
| | 00:46 | site, there is 25% chance that I would
be offered that coupon on this visit.
| | 00:53 | I might see it again, but
that's the probability per visit.
| | 00:57 | First thing we need to do is
calculate the standard deviation.
| | 01:02 | To do that, we can use the number of
sample values and the expected value.
| | 01:06 | And we can use the expected value
instead of taking the mean from another data
| | 01:11 | set, because we know that the average
should be 25%, so we can just use that.
| | 01:16 | So the formula for the standard deviation is
| | 01:18 | =sqrt(, the value in B4,
which is the expected value, multiplied by,
| | 01:30 | the left parenthesis, 1-B4, right
parenthesis, divided by the value in B3,
| | 01:37 | which is the number of sample values.
| | 01:39 | Type a right parenthesis to close the
formula, press Enter, and we get the
| | 01:45 | standard deviation of 0.06. And that make sense;
| | 01:49 | our expected value is .25, so it makes
sense that our standard deviation would
| | 01:54 | be a lower value of 0.06.
| | 01:57 | Now we can calculate the probability
that the number of customers who were
| | 02:00 | actually offered the
coupon would be 28% or less.
| | 02:04 | To do that, we click in cell B8.
Equal sign. And because our data, we're assuming is
| | 02:11 | normally distributed, we can use the
NORMDIST function. Left parenthesis.
| | 02:18 | And then the value we are
testing is x, and that is 0.28.
| | 02:23 | The mean is the expected
value. That's in cell B4.
| | 02:27 | The standard deviation is in B6.
| | 02:30 | We just calculated that. And then
we need to decide whether we want the
| | 02:34 | cumulative or the point estimate for
28%. In this case, we want 28% or less, so
| | 02:40 | we are interested in the cumulative probability.
| | 02:43 | So I'll type "True". Right parenthesis
to close the formula, Return, and we see
| | 02:49 | that the probability is 0.68.
| | 02:51 | So in other words, it's about 70%
likely that 28% or fewer customers would have
| | 02:57 | been offered the coupon.
| | 02:58 | Now we can calculate the same
probability for 22% or less.
| | 03:03 | To do that, we create a similar formula.
It's =NORMDIST, and the value we want to test is .22.
| | 03:12 | The mean is the expected value in cell B4,
standard deviation is in cell B6, and
| | 03:19 | we are interested in the cumulative
probability, so it's "True". Right parenthesis,
| | 03:24 | press Return, and we get
the probability of 0.31.
| | 03:28 | So now we can create a formula in cell
B11, which calculates the probability that
| | 03:34 | between 22% and 28% of our
customers would've been offered the coupon.
| | 03:39 | So to do that, we just subtract the 22%
or less area from the 28% or less area.
| | 03:47 | So its =B8-B9. Press Return
and we get the value of 37%.
| | 03:56 | Based on this analysis, we can see that
it's about 37% likely that between 22%
| | 04:01 | and 28% of our customers will
have been offered the coupon.
| | 04:05 | If we want to take that range higher or
lower, we can change the expected value.
| | Collapse this transcript |
| Estimating the size of a population given numbered samples| 00:00 | Businesses are always trying to get
information on their competitors, whether by
| | 00:04 | shopping their stores or using public
information to make guesses about their
| | 00:07 | performance and intentions.
| | 00:09 | In this movie, I'll show you how to
estimate the size of a population if you
| | 00:12 | have samples of a sequentially
numbered item, such as coupons distributed to
| | 00:17 | a store's customers.
| | 00:18 | So when might you want to
estimate the size of a population?
| | 00:21 | Well, in a competitor
intelligence situation, when you discover that
| | 00:25 | your competitor is giving out coupons--
and you can determine which one in
| | 00:29 | the sequence you get--
| | 00:30 | you can use that information to estimate
the total number of coupons handed out.
| | 00:34 | You could do the same thing for contests,
entries, raffle tickets, and so on.
| | 00:37 | So now let's say that in this situation
your competitor is giving out coupons--
| | 00:42 | and just for the sake of this demonstration,
I will let you know the actual size is 1500--
| | 00:48 | let's say that you have five coupons
that you found, and you've identified a
| | 00:53 | number that you know represents
each coupon within the sequence.
| | 00:57 | In other words, they have
a number from 1 to 1,500.
| | 01:01 | So what you want to do is find a
formula that will somehow estimate the number
| | 01:05 | of coupons in the group.
| | 01:06 | So let's say that you have one row of
data--coupon 1 through coupon 5--and the
| | 01:13 | estimator is the formula that
I'm going to create in a moment.
| | 01:16 | So let's say that each coupon will be
numbered between 1 and 1,500, so I'll type
| | 01:21 | "=RANDBETWEEN" and press Tab to accept
the formal AutoComplete, and then in the
| | 01:28 | formula I will have the lowest number
of 1 and the highest number 1,500, and
| | 01:34 | type a right parenthesis to close the formula.
| | 01:37 | Now when I press Enter, Excel is going
to fill this formula down, throughout the
| | 01:42 | rest of the column, and I do that just
so I can show you the average of all of
| | 01:46 | the estimates at the end of the exercise.
| | 01:49 | We're actually only going to be
concerned, as I'm talking, with the first row.
| | 01:54 | So we have RANDBETWEEN 1 and 1,500.
Press Enter and there we have it, and I
| | 01:58 | will create the same formula =RANDBETWEEN(1,
1500), and the same. And I'm not copying it.
| | 02:10 | In other words, I am not dragging the
fill handle from cell A6 to B6 to C6,
| | 02:15 | because if I did that, then Excel
would not populate the rest of the table
| | 02:19 | column with the formula.
| | 02:21 | You have to type it in to
get it to do it properly.
| | 02:23 | So I'll type "=RANDBETWEEN(1,1500)"
and the final one "=RANDBETWEEN(1,1500)",
| | 02:37 | and there we have it.
| | 02:38 | So these are the values that I will be
working with. And I'm going to select
| | 02:43 | these table columns, and I did that by
moving my mouse pointer over the Coupon1--
| | 02:49 | that's the leftmost table column header--
and then when the mouse pointer changes
| | 02:53 | to a downward-pointing black arrow,
clicking, and then I held down and dragged
| | 02:57 | to select columns 1 through 5. So we are there.
| | 03:01 | Now, on the Home tab, I'll copy the data.
| | 03:03 | You could also press Ctrl+C to copy.
And then on the Paste button list, I'll
| | 03:09 | click Paste Values, and now instead of the
formulas, I have just the individual values.
| | 03:14 | So I'll press Escape to release the
selection, and now I can start thinking
| | 03:19 | about creating the formula.
| | 03:21 | There is no built-in formula in Excel
that you can use to estimate a population
| | 03:26 | given the series of numbered samples.
| | 03:27 | However, there is a formula that you can
use that was developed during World War
| | 03:30 | II that allows you to do it.
| | 03:33 | The first thing you need to know is the
number of coupons. So for that we have 5
| | 03:38 | coupons, 1 through 5, so I will just type in 5.
| | 03:41 | If you didn't know how many coupons
were available to you or how many coupons
| | 03:46 | that you had--at least not in advance--
then you could do a COUNT function or
| | 03:50 | something, but in this case it's
straightforward, so I'll just put in the value.
| | 03:54 | Now that you have that value,
you can create your formula.
| | 03:57 | So I'll go to cell F6 and type "=((" and then "H2"--
| | 04:05 | that's the number of coupons--and when
Excel copies the formula down, I don't
| | 04:09 | want that cell reference changing, so
I'll press F4 to make it an absolute
| | 04:14 | reference, and then "+1".
| | 04:17 | So in other words, I am
taking the number of coupons+1.
| | 04:19 | Then we have the right parenthesis, and I
will divide it by the number of coupons.
| | 04:25 | Again, I don't want that
cell reference changing,
| | 04:27 | so I will press F4 to make
it an absolute reference.
| | 04:31 | So what I've done on this side of the
equation--type in a right parenthesis--is
| | 04:36 | to divide the number of
coupons +1 by the number of coupons.
| | 04:41 | So in this case, it would be 6/5 or 1.2.
| | 04:44 | Now the second thing you do is you multiply--
typing an asterisk and then a left parenthesis--
| | 04:51 | you find the maximum value in your
range, so we have max( and then the range of
| | 04:58 | A6:E6), so that's the maximum value.
And then you subtract one. And over time
| | 05:07 | statisticians have just discovered
that this is the most accurate estimator
| | 05:12 | that they can create.
| | 05:13 | All I need to do now is type a right
parenthesis, and that will allow me to
| | 05:18 | complete the formula. Pressing Enter.
And Excel displays all of the estimations,
| | 05:24 | given these five coupons.
| | 05:26 | Now the first row has a
number of extremely small numbers.
| | 05:30 | So you have got 295, 230, and so on.
| | 05:32 | The highest number, in fact, is 729.
| | 05:36 | So what that means is that your
estimate is going to be low, much lower than
| | 05:39 | 1,500. And in fact, it's 873.
| | 05:42 | If you look at the other rows--just the
ones visible on this sheet--right now,
| | 05:46 | you'll see that many of the estimates
are pretty close: 1,536, 1,545, 1,166--not
| | 05:52 | that close--1,400, 1,744 and so on.
But now let's see what happens over time;
| | 05:57 | in other words, for every value in this
table, what is the average of the estimate?
| | 06:02 | To do that, type "="--and we are in
cell H3--"average(", and then we select the
| | 06:10 | Estimator column of the table, move
the mouse pointer until it's a downward-
| | 06:14 | pointing arrow, over Estimator column
header, click it, and Excel places the
| | 06:19 | reference in the formula. Type a right
parenthesis and hit Return, and you'll
| | 06:25 | see that the average estimate over
time is 1,496 and about three quarters, and
| | 06:31 | that is extremely close to the
actual population size of 1,500.
| | 06:36 | So it's possible that you
get a series of bad samples.
| | 06:39 | For example, if you were working with
the data in this first row, you might
| | 06:42 | think maybe there are only a thousand
coupons and I just found a bunch of the
| | 06:46 | lower numbered ones.
| | 06:47 | However, if you went to the second row
and had that data, then you would have a
| | 06:51 | fairly accurate estimate.
| | 06:52 | You would figure that
1,500 to 1,550 was about right.
| | 06:56 | But again, over time this method of
estimation, as you can see from the average
| | 07:01 | of all the values in the
table, is extremely accurate.
| | 07:04 | Estimating a competitor's customer
base is just one of the ways that you can
| | 07:07 | analyze their performance using statistics.
| | 07:10 | If you want to avoid having this sort
of analysis performed at your expense,
| | 07:13 | don't issue sequentially numbered coupons.
| | Collapse this transcript |
| Calculating a confidence interval for a large sample| 00:00 | If you've collected sample values from
a population, you can make inferences
| | 00:03 | about the population's mean, or
arithmetic average, based on those values.
| | 00:08 | In this movie, I'll show you how to
calculate a confidence interval when your
| | 00:12 | sample set contains 30 or more values.
| | 00:14 | In the next movie, I'll show you how to
calculate a confidence interval when you
| | 00:17 | have a small sample with less than 30 values.
| | 00:20 | A confidence interval is what you get
when you want to say, "I'm 95% sure that--."
| | 00:27 | So in other words, you want to have
a level of certainty based on your
| | 00:31 | statistics that allows you to say I am
95% certain that I'm going to see between
| | 00:36 | 45 and 75 customers in my store tomorrow.
| | 00:39 | So I am going to show you how to
calculate the confidence interval using this
| | 00:44 | Order data table, which has the total of
orders. And my goal is to find the range
| | 00:50 | in which I'm 95% confident that a
particular day's orders will fall.
| | 00:55 | First thing we need to do is find the
number of measurements and for that, I'll
| | 01:00 | just type "=COUNT(",
and now I can select this table column.
| | 01:07 | I move my mouse pointer over the column
header, and when it's a downward-pointing
| | 01:12 | black arrow, I click, and
that selects the entire column.
| | 01:15 | So it's State(New York. Press a right
parenthesis to close the formula, Return,
| | 01:22 | and I see that have 44 measurements.
| | 01:24 | Now to find the average, or the mean, I
can type "=average"--and this formula is
| | 01:29 | going in cell D4--left parenthesis.
And my mouse pointer is over here already,
| | 01:34 | so I will just select the same table
column, press a right parenthesis, press
| | 01:40 | Return, and I see that the average is 135.14.
| | 01:44 | Now to calculate the standard
deviation--in this case I'm just using the
| | 01:48 | standard deviation of the sample--
| | 01:49 | so I'll type "=STDEV"--
| | 01:52 | that's a standard deviation. For a
sample, standard deviation key would be the
| | 01:56 | formula that I'd use if I had an
entire data population, but I don't;
| | 02:00 | I just have the sample. So, left
parenthesis, and select the table column again,
| | 02:05 | right parenthesis to close, and I
have a standard deviation of 65.55.
| | 02:12 | Next I need to specify my confidence
level. And as I stated earlier, I want to
| | 02:17 | be 95% confident in my calculations,
| | 02:20 | so I will type in 0.95;
| | 02:23 | that's equivalent to 95%. Press Return.
| | 02:26 | The significance level is the
confidence level subtracted from 1.
| | 02:31 | So the formula to calculate that is =1-D7.
| | 02:38 | Press Return and we get 0.05.
| | 02:41 | Next, we need to calculate the
z-score for our particular confidence Level.
| | 02:45 | The z-score is the number of standard
deviations from the mean that a value occurs.
| | 02:51 | So in this case I am looking for the 95%
confidence Level, so I want to find the
| | 02:56 | z-score within which 95% of the
values will occur, plus or minus the mean.
| | 03:01 | To calculate that, I type
the absolute value of normsinv,
| | 03:09 | which is the inverse normal
distribution probability function, and the
| | 03:14 | significance level is D8,
and I divide that by 2--
| | 03:19 | again, because the values that I want can be
either plus or minus as compared to the mean.
| | 03:24 | Right parenthesis to close the internal
normsinv function, right parenthesis to
| | 03:30 | close the absolute value function,
press Return and I get a value of 1.96.
| | 03:35 | And as I noted in the previous movie, 1.96
is the number of standard deviations from
| | 03:40 | the mean, plus or minus that
95% of your values will occur.
| | 03:44 | Now we can calculate the standard error.
| | 03:46 | Standard error is calculated using this formula.
| | 03:50 | It's equal the standard deviation, which
is in cell D5, divided by the square root of the
| | 04:00 | number of measurements, and that is
in cell D3. Right parenthesis to close,
| | 04:05 | press Return, and we get
a standard error of 9.88.
| | 04:08 | Now we can calculate the margin of error.
| | 04:12 | The margin of error is the z-score
multiplied by the standard error.
| | 04:16 | In other words the standard error
is the amount of error expected per
| | 04:20 | standard deviation.
| | 04:21 | Our z-score is 1.96, so we
don't have one standard deviation.
| | 04:26 | We need to multiply those two values together.
| | 04:29 | So we have =D9*D11, press Return, and
we get our margin of error of 19.37.
| | 04:40 | Now we can add or subtract the margin
of error from the point estimate, which is
| | 04:44 | the sample mean, and we
find the mean in cell D4,
| | 04:48 | so it's =D4. Press Return, 135.14,
and now we can calculate the lower limit,
| | 04:55 | which is =D14 minus the margin of
error, which is in D12--press Return--and the
| | 05:03 | upper limit, which is =D14+D12, Return,
and we see that the range of our values,
| | 05:12 | with 95% confidence, will fall within
the range of 115.77 and 154.5. Within the
| | 05:20 | business case, that means with 95%
confidence that in order from the customer
| | 05:25 | who lives in New York will be
between $115.77 and $154.50.
| | 05:33 | You can use this information to make
judgments about your customers, perhaps
| | 05:36 | offering the special deals, or if this
range is higher than the average amount of
| | 05:40 | orders from customers and other states,
you can pay more attention to them.
| | Collapse this transcript |
| Calculating a confidence interval for a small sample using t-tests| 00:00 | In the previous movie, I showed
you how to calculate a confidence interval
| | 00:03 | for a population when you
collected more than 30 sample values.
| | 00:06 | In this movie, I'll show you how to
perform the same type of calculation for a
| | 00:10 | small sample, which has, by
definition, less than 30 values.
| | 00:14 | Because you're working with a small
sample size, you will use the t distribution
| | 00:17 | to analyze the values.
| | 00:19 | One of the simplifying assumptions
that you make when you work with a small
| | 00:22 | sample is that your data
has a normal distribution.
| | 00:26 | So the first thing we need to do is
calculate the number of measurements, and
| | 00:30 | that's here in this table.
| | 00:31 | So in cell D3, I will type "=COUNT(",
and then I'll click the
| | 00:39 | column header for the table column,
move the mouse pointer over the column
| | 00:43 | header, click it, Orders Totals,
looks good, right parenthesis to close the
| | 00:48 | formula, and Return, and we
see that we have 12 values.
| | 00:51 | Now we can calculate the mean.
| | 00:53 | For that its =AVERAGE, and then select
the same table column, Orders Totals,
| | 01:02 | right parenthesis, Return,
and there is the average.
| | 01:07 | Finally, we can calculate the
standard deviation, and we create exactly the
| | 01:10 | same type of formula, except we use the
STDEV function, standard deviation for a sample.
| | 01:16 | Left parenthesis and once again select
the table column, right parenthesis and
| | 01:22 | Return, and we see our
standard deviation of 65.98.
| | 01:27 | Next, we can type in our confidence Level, and
we want to be 95% certain, so I'll type .95.
| | 01:35 | Our significance level is
one minus confidence level,
| | 01:39 | so we have =1-D7, and Return.
| | 01:44 | And next--and this is the measure
that we didn't have for the large sample,
| | 01:47 | but when we work with t-scores, we
have to work with degrees of freedom.
| | 01:52 | And degrees of freedom is the number of
measurements minus 1. So we have =D3-1.
| | 02:01 | Finally, we can calculate our target t-score
for 95% confidence level with 11 degrees of freedom.
| | 02:08 | To do that, type = and use the t
inverse function, which returns the inverse of
| | 02:14 | the t distribution, left parenthesis,
D8, which is our significance level,
| | 02:18 | divided by 2, and we are dividing by 2
because we can have a value that's either
| | 02:24 | higher or lower than the mean.
| | 02:26 | If it were only higher or only
lower, then we would not divide by 2.
| | 02:29 | But we do, so we need to divide the
significance by 2, comma, and degrees of
| | 02:35 | freedom, which are found in cell D9.
Right parenthesis to close the formula,
| | 02:40 | Return, and we see that
our target t-score is 2.59.
| | 02:45 | Now we can calculate our standard error.
| | 02:47 | And standard error is calculated using
this formula: = standard deviation, which
| | 02:52 | is D5, divided by the square root of
the number of measurements, and that is
| | 03:00 | found in cell D3. Right parenthesis
and press Return to accept the formula.
| | 03:08 | Now that we know these values, we can
calculate our margin of error, which is the
| | 03:11 | t-score, or the number of standard
deviations from the mean that a particular
| | 03:15 | confidence level is,
multiplied by the standard error.
| | 03:19 | So that is =D10*D12. Press Return
and we get our margin of error of 49.39.
| | 03:29 | Now we can use our point estimate,
which is the mean--and that is =D4--to
| | 03:36 | calculate our upper and lower limits.
| | 03:39 | To do that, the lower limit is the
point estimate, which is in D15, minus the
| | 03:44 | margin of error, which is in D13, and the
upper limit, which is the point estimate
| | 03:51 | in D15, plus the margin
of error, which is in D13.
| | 03:58 | Press Tab and you see that we have
a lower limit of 105.78 and 204.56.
| | 04:04 | Because we are working with fewer
than 30 sample values, the estimate of
| | 04:08 | the range we're 95% certain the population
mean falls within must be more conservative.
| | 04:13 | In the large sample case, the spread
between the upper and lower limits was about $39.
| | 04:18 | In the small sample case, which is
based on the same overall data set, the
| | 04:22 | spread is almost $100.
| | Collapse this transcript |
| Calculating a confidence interval given a population proportion| 00:00 | Unless you're a large, well-funded
organization that can afford a staff to
| | 00:03 | conduct surveys for you, you'll often
have to work with the data you and your
| | 00:07 | existing staff can pull together.
| | 00:09 | One way you can analyze your data is
to calculate a data range, or a confidence
| | 00:13 | interval, that a given value falls
within at a given level of certainty.
| | 00:17 | In this movie, I'll show you how to
calculate a confidence interval at the
| | 00:20 | 95% confidence level.
| | 00:22 | The data we are working with
is a series of orders by state.
| | 00:26 | And instead of working with the
numerical values, I've listed the states from
| | 00:29 | which the orders come.
| | 00:30 | We are interested in analyzing the
proportion of results of orders that
| | 00:34 | come from the State of New York,
so that's why we are working with a
| | 00:38 | population proportion.
| | 00:40 | First, we need to find the
number of measurements in our table.
| | 00:44 | To do that, we create a
COUNTA formula. So it's =COUNTA.
| | 00:50 | COUNTA counts the number of
cells in a range that are not empty.
| | 00:52 | If we used COUNT, it would look for
numbers only, and it would return 0.
| | 00:57 | Type a left parenthesis, and then I
will select the column in the table, move
| | 01:02 | the mouse pointer over the column
header, and when it turns to a downward-
| | 01:06 | pointing black arrow, click,
and we get the table reference.
| | 01:11 | I will type a right parenthesis to
accept it, Return, and we see that we have
| | 01:14 | 372 total measurements.
| | 01:16 | Our target state is New York, but what
we want is the number of occurrences of
| | 01:22 | New York within that table.
| | 01:24 | To do that, we use a COUNTIF formula.
| | 01:26 | So it's =COUNTIF( and then the range
we're examining, and that is our table
| | 01:34 | column again, so I will select it.
And next, we need to know the state that we
| | 01:39 | are looking for, and that is
in cell D4. It's New York.
| | 01:44 | Type in a right parenthesis and
press Return, and we see that we have an
| | 01:48 | occurrence of New York of 259, so there
are 259 New Yorks within that table column.
| | 01:54 | Now we can calculate the proportion,
and that is simply the number of
| | 01:59 | occurrences of New York, which are
in D5, divided by the total number
| | 02:04 | of measurements in D3.7. I had
already filled in our confidence level
| | 02:09 | and significance level.
| | 02:10 | The confidence level is 0.95. That means
we want to be 95% confident in our results.
| | 02:16 | The significance level is 0.05, and that is
simply one minus the confidence level.
| | 02:22 | Now we can calculate our z-score, and
the z-score is the number of standard
| | 02:27 | deviations from the mean, plus or minus
that a certain number of values will occur.
| | 02:31 | So in other words, we need to find the
z-score within which 95% of our values will occur.
| | 02:37 | To do that, type equal, absolute value value, because
| | 02:41 | I don't want to work with negatives. left parenthesis,
n-o-r-m-s-i-n-v. This is the normal inverse
| | 02:49 | distribution, (1-D9/2. And we divide it by
2 because we can be plus or minus the mean.
| | 03:01 | So significance can be
greater than or less than.
| | 03:03 | If we were only interested in less
than, we wouldn't divide it by 2.
| | 03:07 | Type a right parenthesis to
close the internal formula--
| | 03:10 | that's the normsinv function--
| | 03:14 | another right parenthesis to close the
absolute value, Return, and we get a z-
| | 03:18 | score of 1.96, which we know from past
experience is the number of standard
| | 03:23 | deviations for the 95% confidence level.
| | 03:26 | Now we can calculate our standard error,
which is the amount of error we would
| | 03:30 | expect from one standard deviation.
| | 03:32 | To do that, we type Equal, and
there we are taking a square root of the
| | 03:36 | number of measurements, which is in cell D3,
divided by the D6--the sample proportion--times (1-D6.
| | 03:43 | So in other words, we are
multiplying the sample proportion by one minus
| | 03:52 | the sample proportion.
| | 03:54 | So 0.7 by 0.3. Now we can add a
right parenthesis to close the formula.
| | 03:59 | Everything looks right.
| | 04:00 | And we have our standard error of 12.74.
| | 04:04 | Now we can calculate our margin of
error, which is the z-score, the number of
| | 04:08 | standard deviations away from the mean
that our data will be if it's within the
| | 04:11 | 95% confidence level.
| | 04:13 | So we multiply =D10*D12, Return,
and we get our margin of error, 24.97.
| | 04:25 | Now we can have our upper and lower
limit based on the number of measurements,
| | 04:28 | in other words the number of orders
from New York, that we would expect from the
| | 04:32 | entire population based on our sample.
| | 04:34 | So first we have our point estimate,
which is the occurrences of New York, and
| | 04:38 | that is in cell D5, and our lower limit
we'll be entering into cell D16, and that
| | 04:46 | is the point estimate minus the margin of error.
| | 04:49 | So it's =D15-D13, Return, and then our
upper limit is the point estimate plus
| | 04:59 | the margin of error.
| | 05:00 | So it's =D15, the point estimate + D13,
the margin of error. Press Tab. And we
| | 05:08 | see that based on our proportion, we can
be 95% certain that we will have between
| | 05:13 | 234 and about 284 orders
from customers in New York.
| | 05:18 | You can use this type of analysis to
make judgments about your customer base.
| | 05:22 | Because about 70% of your orders come
from New York, you can decide to offer
| | 05:26 | more incentives to customers in other
states, or you can try to develop your
| | 05:30 | existing customers by offering deals
to get them to purchase more products.
| | Collapse this transcript |
| Calculating sample size required for a given confidence level| 00:00 | Political polling often seems like a dark art.
| | 00:03 | When a news organization reports that a
poll shows candidate A leads candidate
| | 00:06 | B by 8% with a 3% margin of error--
at the usually unstated 95% confidence
| | 00:13 | level--you might wonder how many
potential voters they need to survey to get
| | 00:16 | that level of accuracy.
| | 00:18 | In this movie, I'll use a business example
to show you how to calculate the sample
| | 00:22 | size required to reach a given confidence level.
| | 00:25 | To calculate the sample size,
you need the following values:
| | 00:28 | the confidence level--and in this
case we want to be 95% confident--
| | 00:32 | the significance level, which is one
minus the confidence level or in this case
| | 00:36 | 5%, your z-score, margin of error,
sample proportion and sample size.
| | 00:43 | To calculate the z-score, which is the
number of standard deviations from the
| | 00:46 | mean, we use the following formula.
| | 00:48 | It's =normsinv, which is the
standard normal curve inverted, so we are
| | 00:54 | calculating a z-score. Instead of using a
z-score to calculate a value, left parenthesis, and it's one
| | 01:01 | minus the significance level, which is in C4, divided 2.
| | 01:06 | And we divide by 2 because we
want to take the value plus or minus.
| | 01:11 | In other words, if it's from the mean,
it can be higher or it can be lower.
| | 01:15 | We don't care which direction.
| | 01:17 | If we only cared about higher or
lower, then we would not divide the
| | 01:21 | significance level by 2.
| | 01:22 | But this is a two-tailed test, so we are
dividing by 2. Type a right parenthesis
| | 01:27 | and press Enter, and we see the z-score
of 1.96, which from previous experience
| | 01:32 | we know is the proper z-score
for a confidence level of 95%.
| | 01:37 | Now we can enter our margin of error.
| | 01:39 | In other movies, we've calculated the
margin of error, but in this case we
| | 01:42 | want to specify it, because we're
going to use that when we calculate the
| | 01:46 | required sample size.
| | 01:48 | So our margin of error will be
0.03, which is equivalent to 3%.
| | 01:54 | Next we are going to
calculate our sample proportion.
| | 01:56 | And this term isn't exactly obvious, so I'm
going to give it a little bit of explanation.
| | 02:02 | In the next formula, for sample size,
we're going to multiply the sample
| | 02:07 | proportion by one minus the sample proportion.
| | 02:10 | So I will just work here in a separate cell.
| | 02:12 | So let's say that we have the
value of 0.7 for a sample proportion.
| | 02:18 | The sample size formula, which I will do
separately, would multiply, so I'll do =E8*1-E8.
| | 02:30 | So that would be 0.21 because you
are multiplying 0.7 by 0.3, which is 1-E8.
| | 02:37 | So the question is how does that
sample size calculation formula use the
| | 02:43 | sample proportion value?
| | 02:44 | Well, first off, it multiplies it by
one minus itself. And if you want to make
| | 02:50 | the most conservative estimate possible--
| | 02:52 | so, for example, in the case of
political candidates, if you believed that
| | 02:56 | 70% of voters would vote for candidate A,
then you would choose a sample proportion of 0.7--
| | 03:04 | if you want to make the most
conservative estimate possible, you should assume
| | 03:08 | that your sample proportion is 0.5,
| | 03:10 | in other words, that your
population is distributed evenly between the
| | 03:15 | two possible options.
| | 03:16 | And I'll show you why that's the case.
| | 03:18 | If we change the sample proportion to
0.5 and press Tab, you'll see that
| | 03:24 | the value in cell F8 is 0.25, and that is
the largest possible value that you can have.
| | 03:31 | So, for example, if I change the
population proportion to 0.52--still very
| | 03:36 | close to half--but when I press Tab to
accept the value, we see that the sample
| | 03:40 | proportion multiplied by one minus the
sample proportion is less than it would
| | 03:45 | be if the sample proportion were still one half.
| | 03:47 | So to make the most conservative
estimate we can, we will make the population
| | 03:51 | proportion one half. So type in .5.
| | 03:56 | Now we can enter our sample size formula.
| | 03:59 | And this formula is =C5, which is the
z-score, squared, so caret or capital6
| | 04:08 | 2*C8, which is the sample
proportion, times, left parenthesis, 1-C8.
| | 04:18 | Again, we are multiplying the sample
proportion by one minus the sample proportion.
| | 04:23 | And to be as conservative as possible, we
need to make that value as large as possible,
| | 04:27 | so we select 0.5. Then we divide this
entire term by the margin of error, which
| | 04:33 | is in cell C7, squared.
| | 04:36 | So caret, or capital 6, 2.
| | 04:39 | With that formula in
place, you can press Return,
| | 04:42 | and we see the sample size
required of 1,067 persons.
| | 04:46 | Now let's see what happens if we change
our confidence levels and significance
| | 04:50 | levels and also the required margin of error.
| | 04:52 | So for the confidence level of 99%, if I'll
just change this value in C3 to 0.99, we
| | 05:00 | see that we need a sample size of 1,843 people.
| | 05:04 | If we want to change our margin of error
to 2%, which is 0.02, then in cell C7 we
| | 05:10 | would type 0.02 and press Return.
| | 05:13 | And you see that the sample
size goes up quite significantly.
| | 05:16 | Now you might be wondering, why
haven't I talked about populations?
| | 05:19 | In other words, you know, maybe 100 million
or 150 million voters in the United States,
| | 05:24 | why isn't that figure being
used in any of these equations?
| | 05:28 | And that's because it doesn't matter.
| | 05:30 | Once your sample size gets large enough,
such as in this case where you have
| | 05:33 | over 4,000 individuals, then with the
certain level of confidence--either 95% or
| | 05:37 | 99%--you will know that your survey
will reflect your population accurately.
| | Collapse this transcript |
|
|
7. Testing HypothesesDefining hypotheses and identifying errors| 00:00 | When you analyze your data using
statistics, you should know what aspect of your
| | 00:04 | data you want to test.
| | 00:05 | The statement that embodies your
test is called your hypothesis.
| | 00:09 | A hypothesis is an educated guess about
the characteristics of a data set and
| | 00:13 | the circumstance it describes.
| | 00:15 | For example, you could create a
hypothesis that says the state a customer lives
| | 00:20 | is related to the amount of olive
oil they order from your company.
| | 00:23 | When you design a statistical analysis,
the first thing you should do is state
| | 00:27 | your null hypothesis, which says that the
fact you're analyzing has no effect on the data.
| | 00:33 | For example, you could say there will
be no difference in the average order
| | 00:36 | amount based on the state the customer lives in.
| | 00:40 | So, if a customer lived in New York,
then they would not on average order more
| | 00:45 | than customers that live in Connecticut.
| | 00:47 | If you're reading about
statistics, you might see the null
| | 00:50 | hypothesis represented as H subzero.
| | 00:53 | In this case, the zero means that the
variable upon which you have formed your
| | 00:57 | hypothesis has zero effect.
| | 01:01 | Next, you create an alternative
hypothesis, which states that there is a
| | 01:05 | relationship between two variables.
| | 01:07 | For example, you could state that the
average order of customers from New York
| | 01:11 | cost more than the average
order of customers from Connecticut.
| | 01:14 | The alternative hypothesis is
represented as H sub A, and in this case, the A
| | 01:20 | stands for alternative.
| | 01:23 | This alternative hypothesis is
what's called a directional alternative
| | 01:27 | hypothesis in that it specifies that
the average values of order from New York
| | 01:31 | customers is greater than the
average value of orders from Connecticut.
| | 01:34 | A non-directional alternative
hypothesis would state that the average order
| | 01:39 | of customers from New York is different from
the average order of customers from Connecticut.
| | 01:44 | That means under that alternative
hypothesis, the New York orders could be
| | 01:48 | greater than or less than orders from
Connecticut, and the non-directional
| | 01:52 | alternative hypothesis would still be true.
| | 01:54 | So how do you create a
good alternative hypothesis?
| | 01:57 | Well, there are number of steps that
you can go through, and if your alternative
| | 02:01 | hypothesis meets all four of these tests,
then you've probably created one that
| | 02:06 | can be tested accurately.
| | 02:08 | First, you need to state explicitly
that there is an expected relationship
| | 02:13 | between two variables.
| | 02:15 | Second, you need to base your
alternative hypothesis on your knowledge of
| | 02:19 | the world, and this is from looking at the
statistics that you have in your possession.
| | 02:24 | So, for example, if you know that the
average order from New York customers is
| | 02:28 | 150 and the average order for
customers from New Jersey is 125, there is the
| | 02:32 | possibility of a difference that
you can investigate a little bit more.
| | 02:37 | Next, you want to express the
hypothesis simply and briefly so it's easily
| | 02:42 | understood, and also make sure
that you are only testing one thing;
| | 02:46 | you don't want to have "and" or
"or" anything like that in your
| | 02:52 | alternative hypothesis.
| | 02:53 | It should be stated very simply
one fact that you are testing for.
| | 02:58 | And finally, make your
alternative hypothesis testable.
| | 03:01 | If you don't have any data that
indicates the amount that New York customer
| | 03:05 | spend or the amount that New Jersey
customer spend, then you can't really
| | 03:09 | test that hypothesis.
| | 03:10 | So look at your data, and from that and
your knowledge of your business and its
| | 03:14 | operations, decide what it
is that you want to test for.
| | 03:18 | There are a couple of errors that you
can have in hypothesis testing, and those
| | 03:22 | two errors are called
TYPE1 errors and TYPE2 errors.
| | 03:27 | A TYPE1 error is a false positive,
and that is when you reject a null
| | 03:32 | hypothesis that is true.
| | 03:34 | So let's assume, for example, the
New York customers actually do spend more
| | 03:37 | than customers from New Jersey when
they purchase items from your store.
| | 03:41 | It's possible that the data that you
collect causes you to reject the null
| | 03:45 | hypothesis that New York customers do not
spend more money per order when they actually do.
| | 03:51 | So that's called a false positive.
| | 03:53 | And remember, anytime you collect data
and perform a statistical analysis using
| | 03:57 | inferential techniques, you have
a confidence level such as 95%.
| | 04:02 | That means even if your experiment is
designed properly, that about 5% of the
| | 04:07 | time, you will generate an error.
| | 04:10 | If the error is a false positive, you
reject a null hypothesis that is true.
| | 04:14 | The second type of error, which is a
false negative, says that you fail to reject
| | 04:19 | a null hypothesis that is false.
| | 04:21 | So in other words, if your null
hypothesis is that New York customers don't
| | 04:25 | spend more per order, and they
actually do, but your examination of your data
| | 04:29 | does not lead you to that conclusion,
then you have failed to reject the null
| | 04:33 | hypothesis, and you have a TYPE2 error.
| | 04:37 | For a more common example and one that
even if you don't work with statistics you
| | 04:40 | might have encountered, think about
a jury trial in the United States.
| | 04:44 | In the U.S., the null hypothesis
is that a defendant is innocent.
| | 04:49 | So a TYPE1 error is that the
innocent person is convicted.
| | 04:53 | A TYPE2 error is that the
guilty person is set free.
| | 04:57 | Following the four rules of creating a
good alternative hypothesis will help cut
| | 05:01 | down on errors, but always remember
that there is a possibility of error when
| | 05:05 | you deal with inferential statistics.
| | 05:07 | Anytime you work with data from less
than an entire population, there is some
| | 05:11 | danger that you will
reach an incorrect conclusion.
| | Collapse this transcript |
| Performing one-tailed tests on a large sample's mean| 00:00 | In this movie, we will use a one-tailed
test to examine a sample's average value.
| | 00:06 | A one-tailed test analyzes data to
see if it deviates significantly from a
| | 00:09 | target value in one direction: either
less than or greater than, but not both.
| | 00:15 | The techniques I am going to show you in
this movie assume that we are working with
| | 00:18 | a large sample, so the data set
must contain at least 30 values.
| | 00:22 | Now the goal is to test the hypothesis
that a discount coupon mailed to 30 sets
| | 00:27 | of 200 randomly selected customers
within a 5 mile radius of the Two Trees store
| | 00:32 | will bring in 18 new customers
on average per mailing group.
| | 00:37 | So my target value, which I
have here in cell D6, is 18.
| | 00:42 | So that's my target value, and
we will use that in a moment.
| | 00:45 | But first, I need to fill in the rest
of the values in the worksheet, so that we
| | 00:49 | can create our formulas.
| | 00:51 | So first we need to calculate our
sample size. That's in cell D2.
| | 00:57 | So I'll type equal, and we are going
to count the number of measures that
| | 01:02 | we have in our table.
| | 01:04 | So I will select the table column by
moving the mouse pointer over the table
| | 01:08 | column header, and when the mouse
pointer is a downward-pointing black
| | 01:11 | arrow, click the left mouse button to select it,
and you see that it's added to the formula.
| | 01:18 | When I type a right parenthesis and Return,
Excel counts my sample size, which is 30.
| | 01:23 | Now I can calculate the mean, or the average.
| | 01:26 | To do that, it's =AVERAGE(, and I select
the table column again, and then a right
| | 01:33 | parenthesis and Return, and that's the mean.
| | 01:37 | Now I can calculate the standard deviation,
and I create exactly the same type of
| | 01:40 | formula: stdev, because I am working
with the sample instead of the entire
| | 01:45 | population. Left parenthesis
and select the table column again.
| | 01:50 | There it is. Right parenthesis, Return,
and there is my standard deviation.
| | 01:55 | I already have my target value of 18,
| | 01:58 | so I can move on to
computing the standard error.
| | 02:02 | In this case, the standard error is the
standard deviation divided by the square
| | 02:07 | root of the sample size.
| | 02:09 | So I have those values in my worksheet already,
| | 02:11 | so it's =D4 divided by the square
root--that's sqrt--left parenthesis, of
| | 02:20 | the sample size, which is in cell D2.
Right parenthesis. Everything looks good,
| | 02:25 | so my standard error is 1.9885.
| | 02:29 | Now, I can calculate my z-score.
| | 02:31 | And when I have this particular set of
statistics, I calculate my z-score by
| | 02:37 | typing =(, and then you have the mean.
| | 02:42 | You subtract the target value, which
is D6, right parenthesis, and you divide that by your
| | 02:50 | standard error, which is in cell D8.
Press Return and we get a z-score of -1.9277.
| | 03:00 | Now we can calculate the
probability using that z-score.
| | 03:03 | Remember, z-score is the number of
standard deviations we are from a mean.
| | 03:07 | So we have the
probability is =1-normsdist--
| | 03:15 | that's the standard normal distribution--
left parenthesis, and then the z-score,
| | 03:20 | which is in cell D9. Right parenthesis,
everything looks good, Return, and we
| | 03:26 | have our probability of 0.9731.
| | 03:30 | Now note that that probability is
extremely high, which indicates that we should
| | 03:34 | not reject the null hypothesis that
the coupon mailing would bring in 18 new
| | 03:39 | customers for mailing group.
| | 03:40 | And you can see, based on the mean of
the number of new customers, that a mean
| | 03:45 | of 14.167 isn't anywhere close to 18,
| | 03:49 | so the probability that we have here
makes sense when it comes to failing to
| | 03:54 | reject the null hypothesis.
| | Collapse this transcript |
| Performing two-tailed tests on a large sample's mean| 00:00 | In the previous movie, I showed you how
to perform a one-tailed, or directional,
| | 00:04 | test on a large sample data set.
| | 00:07 | In this movie, I'll show you the
same technique, using a two-tailed test.
| | 00:12 | You use the two-tailed test to check
if the average of a sample is within a
| | 00:16 | given number of standard deviations
of the target mean in either direction.
| | 00:20 | For example, if you sell 12-ounce
bottles of olive oil, you want the average
| | 00:24 | amount of olive oil to be very
close to the advertised volume.
| | 00:28 | As with one-tailed tests, you use the
sample size, mean, and standard deviation,
| | 00:33 | and target value to
compute your other statistics.
| | 00:36 | So let's get those into the worksheet now.
| | 00:39 | For the sample size, we need to count
the number of measures in our table.
| | 00:43 | So I will type "=COUNT", and this is in
cell D4, left parenthesis, and then the
| | 00:51 | values. And I'm going to select the
table column by hovering the mouse pointer
| | 00:56 | over the column header.
| | 00:58 | When the mouse pointer changes to a
downward-pointing black arrow, click
| | 01:01 | for left mouse button, and I get the table
name of Fill, and the column name of Ounces.
| | 01:07 | Right parenthesis to close the formula,
Return, and we got a sample size of 30.
| | 01:13 | Now we calculate the mean, or the average.
| | 01:15 | So for that it's =average( and select the
table column again, Fill, Ounces, right
| | 01:23 | parenthesis, Return, and we
see an average of 12.1367.
| | 01:29 | Now the standard deviation, same type of
formula stdev, and we're working with the sample,
| | 01:35 | so we'll use this one as
opposed to the population version.
| | 01:39 | Left parenthesis, and select the table
column again, right parenthesis, Return,
| | 01:44 | and we have a standard deviation of 0.3.
| | 01:48 | I filled in our target value of 12
ounces earlier, so now I can go on to
| | 01:52 | computing the standard error.
| | 01:56 | To calculate my standard error,
I divide the standard deviation,
| | 02:00 | so =D6, and divide that by the square
root of the sample size, which is in cell D4.
| | 02:08 | So it's SQRT, the square root function,
left parenthesis, D4, right parenthesis
| | 02:15 | to finish the formula, Return,
and there is my standard error.
| | 02:20 | Now to calculate the z-score, I use
this formula, and that is equal =(D5, which
| | 02:26 | is the mean, minus D8, which is the
target value, right parenthesis. And we
| | 02:34 | divide that result by the standard
error, which is D10. Press Return and we
| | 02:40 | get a z-score of 2.462.
| | 02:41 | Now that's a fairly large z-score,
| | 02:44 | so that tells us that it's likely that
this mean is significantly higher than
| | 02:49 | it should be if our target value is 12 ounces.
| | 02:53 | To continue our analysis, let's
take a look at our lower tail.
| | 02:57 | The lower tail will give us the
percentage of values under the normal curve that
| | 03:01 | will be less than this z-score,
| | 03:03 | the percentage of the curve
that is to the left of the score.
| | 03:07 | So we type "=normsdist", which returns a
standard normal distribution, and that
| | 03:14 | takes one argument, our z-score,
which is in D11. Right parenthesis and Return.
| | 03:20 | We see that 99% of the values under the
standard normal curve are to the left of our z-score.
| | 03:27 | Now to calculate the upper tail, we
subtract the lower tail value from 1.
| | 03:32 | So that is =1-D13. And that value corresponds;
| | 03:40 | it's a less than 1% of the values
above that z-score under the normal curve.
| | 03:45 | Now we multiply the smaller of those
two values by 2 to find the probability
| | 03:49 | that the value we calculated
is in the tails of the curve,
| | 03:53 | in other words the rejection area
outside of where we want the mean to be.
| | 03:58 | So, to find the lower of the two
values, we type "=MIN(D13:D14)", and then we
| | 04:10 | multiply that value by 2.
| | 04:12 | Press Return, and we get
our answer, so about 1.3%.
| | 04:18 | So, how do we interpret these results?
| | 04:19 | Well, the Two-Tail value is
less than .05, so we can't,
| | 04:23 | with 95% confidence, reject the null
hypothesis that the mean fill level for the
| | 04:28 | bottles average is 12 ounces.
| | 04:31 | What we can do is use the values in
the worksheet to create a confidence
| | 04:34 | interval, which will verify that the
target mean of 12 ounces is outside the
| | 04:38 | range we calculated.
| | 04:40 | The formula we use multiplies the
z-score by the standard error, and then adds
| | 04:45 | and subtracts those values from the mean.
| | 04:47 | We're going to use the z-score that
corresponds to the 95% confidence level, and
| | 04:52 | that number is 1.96.
| | 04:53 | So, to create the upper bound, we add
the mean, which is in cell D5, +1.96 times
| | 05:05 | the value in cell D10, which is
the standard error, and press Return.
| | 05:09 | So our upper bound is 12.2456.
| | 05:12 | Now we can do the same thing to
calculate the lower bound, except that we've
| | 05:16 | subtract the z-score times the
standard error from the mean.
| | 05:21 | So for this, we type "=D5-1.96",
which is the z-score for 95% confidence
| | 05:30 | multiplied by the standard
error, which is cell D10.
| | 05:34 | Press Tab and we get a lower bound of 12.0279.
| | 05:39 | What this means, with 95% confidence, is
that we are filling our bottles too full,
| | 05:43 | and we need to adjust their processes.
| | Collapse this transcript |
| Testing a small sample's mean| 00:00 | When you need to analyze data based on a
small data sample--which means one with
| | 00:04 | less than 30 data points--you need to
use the t distribution to test whether the
| | 00:08 | sample data provides convincing evidence
to support your alternative hypothesis.
| | 00:13 | When you test the small sample's mean
for significance, you're comparing the
| | 00:17 | measured value against a target value.
| | 00:19 | In this example, 14 customers
filled out satisfaction surveys with 10
| | 00:23 | questions, and assigned values to 10
aspects of the two trees store, and those
| | 00:28 | values were from one to 10.
| | 00:30 | The answers are translated to a
score from 0% to 100%, and they are
| | 00:34 | recorded here in this table.
| | 00:36 | The first thing we need to do is to
count the number of measurements, find their
| | 00:40 | average, and the standard deviation.
| | 00:42 | For that, we use these formulas.
| | 00:44 | For the measurements, it's =COUNT(, and
then I'm going to select this table column.
| | 00:53 | To do that, I move the mouse pointer
over the table column header, and when the
| | 00:56 | mouse pointer is a downward-pointing
black arrow, I click the left mouse button,
| | 01:02 | and that fills in the reference, which
is Ratings table, Satisfaction column.
| | 01:07 | Right parenthesis to close, Return,
and I see that I have 14 measurements.
| | 01:12 | I'll do the same thing for the average;
| | 01:14 | =average, select the table column again,
right parenthesis to close, and the
| | 01:20 | mean is 0.85, or about 85%.
| | 01:23 | Then the standard deviation, =stdev,
and again, we're working with the sample,
| | 01:29 | not an entire population.
| | 01:31 | Left parenthesis and select the table
column again. Ratings. Satisfaction. Looks
| | 01:36 | good. Right parenthesis to close,
and we have a standard deviation of 0.08.
| | 01:42 | I've already entered the target value,
| | 01:44 | so now I can calculate the standard error,
the t-score and the degrees of freedom.
| | 01:50 | To calculate standard error, we
divide the mean by the square root of the
| | 01:54 | number of measurements.
| | 01:56 | So it's =D5, which is the mean,
divided by the square root--
| | 02:01 | sqrt is that function--left
parenthesis, of the count of the number of
| | 02:06 | measurements, and that is in cell D4.
Right parenthesis to close, and that looks
| | 02:11 | good. Return and we have
a standard error of 0.228.
| | 02:15 | So now with those values, we can
calculate our t-score, which is the number of
| | 02:20 | standard deviations from the
mean that our value occurs.
| | 02:22 | So I'll type "=(", type in the mean which
is in cell D, subtract the target value,
| | 02:31 | which is in D8, right parenthesis, and
then divide those values by the standard
| | 02:37 | error, which is in cell D10.
| | 02:40 | Press Return and we get a
t-score of 0.23776, a very low t-score.
| | 02:46 | Now we need to add in our degrees of freedom.
| | 02:49 | The degrees of freedom are the
number of measurements minus one.
| | 02:53 | So that is =D4-1, and we
have 13 degrees of freedom.
| | 03:00 | You can use the t-score along with
the degrees of freedom to calculate the
| | 03:04 | probability that the mean is
equal to the target value of 80.
| | 03:08 | To do that, I'll type the formula
=tdist(, T11, which is the t-score, the
| | 03:16 | degrees of freedom, which are in cell D12,
comma, and then the number of tails.
| | 03:22 | I want to do a one-tail test,
| | 03:24 | so I'll type 1, Return, and we
have a lower tail of 0.40789.
| | 03:32 | What that means is about 40.7% of the
values under the standard normal curve
| | 03:37 | will be less than, or to the
left of, the value that we computed.
| | 03:42 | The probability of the upper tail, in
other words the probability that the
| | 03:46 | mean is greater than what we observed,
we calculate by subtracting the value
| | 03:51 | we just calculated from 1.
| | 03:52 | So =1-D14. The value is 0.59.
| | 04:00 | In this case, customer
satisfaction is quite high.
| | 04:04 | The two-tail probability is large
enough so that we can reject the null
| | 04:07 | hypothesis that a customer's
satisfaction averages the value of 80.
| | 04:12 | Fortunately, that's because the
value is higher than the target.
| | Collapse this transcript |
| Testing the mean of a population proportion| 00:00 | When you analyze your business's
operations, you'll often concentrate on one
| | 00:03 | value, such as the state
your customers come from.
| | 00:07 | Now let's say that you want 64% of your orders
to come from customers who live in New York.
| | 00:12 | You can use population proportion analysis
techniques to determine if you've met your goals.
| | 00:18 | I've set up a sample worksheet with the
data that you need to use to find this
| | 00:21 | information, and we can start
filling it in with formulas.
| | 00:25 | First, we need to know the total
number of measurements, in other words the
| | 00:29 | number of orders by state,
here in the State column.
| | 00:33 | To do that, we use =COUNTA, and we use
COUNTA to see if a cell has any value in it.
| | 00:42 | If we were to use COUNT, Excel would
only look for numbers, and you wouldn't get
| | 00:45 | the answer you're looking for. It would display 0.
| | 00:47 | Left parenthesis. And then we need to
identify this table column, so I will move
| | 00:53 | the mouse pointer over the table
column header, and when it is a downward-
| | 00:57 | pointing black arrow, click, and Excel
fills in that table column reference.
| | 01:03 | Type a right parenthesis. Everything looks good,
and we get a total number of measurements of 372.
| | 01:08 | I already filled in our
target state, which is New York.
| | 01:13 | Now we need to find the number of
times that New York occurs in that column.
| | 01:17 | To do that, we use a COUNTIF formula.
| | 01:19 | So its =COUNTIF(, and then we are
looking at the range which is this
| | 01:26 | table column here, so I'll just click it
again using the same technique I used earlier.
| | 01:32 | States, State, and then we need to
type a comma and indicate what we we're
| | 01:36 | looking for, and that is the value in D4,
which is the State of New York.
| | 01:40 | Type a right parenthesis, press Return,
and we see that we have 259 occurrences.
| | 01:46 | The sample proportion is the percentage
of sample values that match the target.
| | 01:51 | To find that percentage, we divide
the occurrences of New York by the total
| | 01:54 | number of measurements.
| | 01:55 | So that's =D5/D3. Press Return
and you get a proportion of 0.7.
| | 02:03 | Now as I stated earlier,
our target value is .64.
| | 02:07 | In other words, our goal was to see
whether we can get at least 64% of our
| | 02:12 | customers come from New York State.
| | 02:14 | Now we need to check to see whether our
sample proportion of .7 is statistically
| | 02:19 | significant and greater than .64.
| | 02:22 | So we need to make a few more calculations.
| | 02:24 | The first is we need to
calculate the standard error.
| | 02:28 | We calculate the standard error for
this type of data by using this formula.
| | 02:32 | It's equal, square root, left
parenthesis, of D6, which is the sample proportion,
| | 02:40 | multiplied by (1-D6. So we are
multiplying the sample proportion by one minus
| | 02:49 | the sample proportion, so .7 times .3,
and then we are dividing that value by
| | 02:55 | the total number of
measurements, which is found in cell D3.
| | 02:59 | Type a right parenthesis,
so everything looks right.
| | 03:02 | Press Return and we get
our standard error of .02.
| | 03:06 | Now we can calculate our z-score, and
to do that for this type of data, we use
| | 03:10 | this formula. It's =(D6,
| | 03:14 | which is the sample proportion,
minus D8, which is the target value. Right
| | 03:19 | parenthesis. Divide that by the value
in D10, which is the standard error, and
| | 03:25 | that will tell us our z-score.
Press Return and we get a value of 2.36.
| | 03:31 | Now we can use the normal distribution
functions to see where that z-score falls
| | 03:35 | within the normal curve.
| | 03:37 | We are going to use three separate
measurements, so depending upon the type of
| | 03:41 | analysis you want to perform,
you will see those values here.
| | 03:44 | So for looking for the
percentage of the curve that is below our
| | 03:48 | particular z-score, which is 2.36 standard
deviations above the mean, we use this formula.
| | 03:54 | It's =normasdist(D11). Press Return
and we get a probability of .99.
| | 04:07 | So in other words, that tells us
that this value is quite high.
| | 04:11 | We can do the same thing to calculate
our upper tail percentage, or probability,
| | 04:15 | and that is =1-normsdist(, and our
z-score is in D11. Right parenthesis to close.
| | 04:26 | We get a probability of .01.
| | 04:29 | And finally, we can calculate the
two-tailed probability, which tells us whether
| | 04:32 | the value is higher or lower than the mean.
| | 04:35 | So the probability of that is
=min(D13:D14). So in other words, we are finding the
| | 04:49 | smaller of these two values,
and multiplying that value by 2.
| | 04:53 | Press Tab and we get a probability of .02.
| | 04:57 | These three probabilities let you
apply the test you are interested in,
| | 05:00 | whether the sample proportion is
significantly smaller than the target
| | 05:03 | proportion, larger than the target
proportion or if it is greater than or less
| | 05:08 | than the target proportion.
| | 05:10 | In this case, the sample proportion of
.7 appears to be significantly larger
| | 05:14 | than the target value of .64, so your
efforts to get more orders from New York
| | 05:18 | customers appear to be paying off.
| | Collapse this transcript |
| Estimating the difference between two populations' means| 00:00 | Many businesses have two or more locations,
| | 00:03 | so you can gain a lot of information
about your organization by comparing the stores' results.
| | 00:07 | In this example, we are assuming that
the Two Trees Olive Oil Company has one
| | 00:11 | location in New York and another New Jersey.
| | 00:14 | We have more than 30 sample
sales values for each store,
| | 00:17 | so we can use large sample
techniques to analyze our data.
| | 00:20 | We're going to examine the
difference between the values of the store's
| | 00:23 | average order and to determine the
interval the actual average will fall into
| | 00:28 | with 95% confidence.
| | 00:30 | The first thing we need to do is
discover the number of measurements that we
| | 00:34 | have for New York and New Jersey,
| | 00:36 | in other words the number of sales that are
recorded in this table here on the left side.
| | 00:41 | To do that, we use a COUNT formula.
| | 00:43 | So it's =COUNT(, and we are
looking for New York in this cell.
| | 00:50 | So I will go over to the table and
hover the mouse pointer over the New York
| | 00:53 | column header, and when the mouse
pointer changes to downward-pointing black
| | 00:57 | arrow, left-click, and Excel adds that
table reference to my formula. Type a
| | 01:03 | right parenthesis to close it,
press Return, and we get 44.
| | 01:07 | Now we do the same thing for New Jersey.
| | 01:09 | So it's equal =COUNT( and then select
the New Jersey table column using the same
| | 01:16 | technique I used for New York.
The reference looks correct.
| | 01:19 | Type a right parenthesis, press Return,
and we get the same number of measurements.
| | 01:25 | Now the numbers don't have to be the same;
| | 01:27 | they just both have to be larger than 30.
| | 01:29 | Now we need to calculate the mean and
to do that we use the AVERAGE formula.
| | 01:34 | So it's =AVERAGE(, and again we are
looking for New York, so select the table
| | 01:40 | column, right parenthesis, and press Tab.
| | 01:44 | We get that average.
| | 01:45 | Now we will do the same thing for New Jersey.
| | 01:46 | So it's =AVERAGE(, select the
New Jersey column, right parenthesis, and
| | 01:53 | Return, and now we do the standard
deviation, and that is the STDEV function.
| | 01:59 | And we are working with a
sample instead of population,
| | 02:02 | so we will use STDEV instead of STDEVP.
Left parenthesis, select the New York
| | 02:07 | column, right parenthesis, and Tab,
and we do the same thing for New Jersey:
| | 02:12 | stdev, left parenthesis, and select the
New Jersey column, right parenthesis, and
| | 02:19 | Return, and there we have the values.
| | 02:22 | I mentioned earlier that we are going
to be using a 95% confidence level, so I
| | 02:27 | entered that in here in cell E7, and that
means we have a significance level of .05, or 5%.
| | 02:35 | Your significance level is
1 minus your confidence level.
| | 02:38 | Now we can calculate our target
z-score for the 95% confidence level. And to
| | 02:44 | do that, we use the formula =normsinv and
our probability, which is 1-E8, divided by 2.
| | 02:57 | We divide the significance level by 2
because we are interested in whether our
| | 03:00 | measurements are off plus or minus;
| | 03:03 | in other words, it's a two-tailed test.
| | 03:05 | If we were only interested in whether
the value for one were above the other value
| | 03:10 | or below the other value, but we didn't
want to look in both directions, then we
| | 03:13 | would not divide by 2.
| | 03:15 | But in this case, we are looking at both
above and below, or greater than or less
| | 03:18 | than, so we divide by 2.
| | 03:19 | Type a right parenthesis, Return,
and we get the z-score of 1.95.
| | 03:25 | It's 1.96, which we know from previous
experience to be the 95% confidence level z-score.
| | 03:32 | Now we can calculate our standard
error, and this is the amount of expected
| | 03:37 | error that we would see for each
standard deviation. And the formula we use to
| | 03:41 | calculate that when we are comparing
the differences of two means is equal,
| | 03:48 | square root, left parenthesis, E5,
squared--and to do that we type a caret or
| | 03:55 | capital 6 and a 2--divided by E3, plus F5--
which is the standard deviation for the New
| | 04:06 | Jersey measurements--squared, so that's ^2,
divided by F3 which is the number of measurements.
| | 04:14 | Type a right parenthesis.
Everything looks good.
| | 04:17 | Press Return and we get
our standard error of 12.79.
| | 04:22 | We get the margin of error by
multiplying the standard error by the z-score, and
| | 04:26 | that is =E11*E9. Return.
| | 04:32 | So our margin of error is 25.08.
| | 04:36 | Now we can calculate our point estimate,
which is the difference between the two
| | 04:39 | means here of New York and New
Jersey, which are in cells E4 and F4.
| | 04:45 | So our point estimate is =E4-F4,
Return, and we get that value of 38.045.
| | 04:55 | Finally, we can calculate the lower
limit and the upper limit for the difference
| | 04:59 | between those two averages.
| | 05:01 | The lower limit is the point
estimate minus the margin of error.
| | 05:04 | So that is =E14-E12, Return, and the
upper limit is the point estimate plus the
| | 05:15 | margin of error, so that is =E14+E12.
Press Tab. So we see, with 95% confidence, that
| | 05:26 | the actual difference between the two
means of the average order between New
| | 05:30 | York and New Jersey is
somewhere between $12.96 and $63.12.
| | Collapse this transcript |
| Testing hypotheses related to population means| 00:00 | As the owner of a business, you're always
looking for ways to improve your processes.
| | 00:04 | When you try something new, you should
analyze your results to determine whether
| | 00:07 | the new process is better, the same,
or worse than your current plan.
| | 00:11 | In this movie, I'll show you how to
compare the results of two olive oil
| | 00:14 | purification processes used by Two Trees.
| | 00:18 | Our data is in a table in the
first two columns of the worksheet.
| | 00:21 | These are the measures of
purity for the two processes.
| | 00:24 | Process1 is the existing process,
and Process2 is the new process that the
| | 00:29 | company is testing.
| | 00:31 | Because we're testing a hypothesis,
we need to state a null hypothesis.
| | 00:35 | The null hypothesis says that the
average purity levels for Process1
| | 00:39 | and Process2 are equal.
| | 00:40 | In other words, there is no difference between
the average purity as a result of either process.
| | 00:47 | The data analysis tool we'll use
requires us to calculate the variances of the
| | 00:51 | two data sets, and input the values manually.
| | 00:54 | So we'll create VAR
formulas to calculate those values.
| | 00:58 | So in cell E3, I'll type "=VAR"--
| | 01:03 | and this is variance for a sample,
not an entire population, so we use
| | 01:07 | VAR instead of VARP.
| | 01:08 | Left parenthesis and then I
will select the Process1 column.
| | 01:13 | To do that, I move the mouse pointer
over the column header, and when the mouse
| | 01:17 | pointer changes to a downward-pointing
black arrow, click, and the reference
| | 01:22 | appears within the formula.
| | 01:24 | Type a right parenthesis, press
Return, and we see the variance.
| | 01:27 | Now, I'll do the same thing for Process2:
| | 01:29 | =VAR, left parenthesis, Process2, table
reference is correct, right parenthesis,
| | 01:38 | Return, and we have our values.
| | 01:41 | Now we can use the data
analysis tool that is called z-Test:
| | 01:45 | Two Sample for means.
| | 01:47 | To run that, we go to the Data tab of
the Ribbon and then in the Analysis group
| | 01:52 | click the Data Analysis button.
| | 01:54 | If you don't see the Data Analysis
button or the Analysis group on your Ribbon,
| | 01:58 | then you need to install the Analysis
ToolPak, and I show you how to do that
| | 02:02 | in the movie near the beginning of the course.
| | 02:04 | But now, let's assume it's installed.
Click Data Analysis, and then, at the
| | 02:10 | bottom of the list, click z-Test:
| | 02:13 | Two Sample for means, and click OK to run it.
| | 02:17 | In the dialog box, I can select our
variables. And the variables for Range 1 are
| | 02:23 | in the table column for Process1.
| | 02:25 | So I click the Collapse dialog button,
and select the column, same way I did
| | 02:30 | when I was creating the variance formulas.
| | 02:32 | Click. And cell range appears there,
so I can click the Expand Dialog button.
| | 02:37 | Do the same thing for the variable 2
Range. Select that, the references look good,
| | 02:44 | and expand the dialog.
| | 02:46 | Now I need to enter my hypothesized
mean difference, and if I move the dialog
| | 02:50 | box out of the way of the null
hypothesis, you'll see that I'm hypothesizing
| | 02:55 | that the average purity levels
for Process1 and Process2 are equal.
| | 02:58 | That means the mean difference that
I'm hypothesizing is 0. So I'll type 0.
| | 03:05 | Now we need to enter in the
variance for Variable 1 and Variable 2.
| | 03:10 | You can't use a cell reference for this.
| | 03:12 | I'm not sure why, but the
tool just doesn't allow it.
| | 03:14 | But that's why we created the variance
formulas inside of our worksheet, and we
| | 03:18 | can type in the values.
| | 03:20 | So the Variable 1 variance is 37.4989247.
| | 03:27 | Then press Tab to move to the
Variable 2 box, and that is 35.4494624.
| | 03:35 | We don't have any labels in our selections.
| | 03:37 | We selected A4 through A34, and A3
is where the labels are, so we won't
| | 03:42 | have them this time.
| | 03:44 | Our alpha value is the significance
level that we're looking for, and it is
| | 03:48 | set by default to .05.
| | 03:50 | We can change it, but an alpha, or
significance level, of .05 is equivalent to a
| | 03:55 | confidence level of .95 or 95%.
| | 03:59 | That's the standard measure
used in most statistical analyses.
| | 04:03 | Finally, we can select our output range.
| | 04:06 | I want to put the data on the same worksheet,
| | 04:08 | so I will leave the Output Range
option button selected, click the Collapse
| | 04:13 | Dialog button, and click cell D6.
| | 04:19 | Click the Expand Dialog button.
All my values look correct,
| | 04:22 | so I can click OK, and Excel
adds the data into the worksheet.
| | 04:26 | I will expand column D, so that
you can see all of the values.
| | 04:32 | So as part of this tool, Excel
calculates a number of values.
| | 04:35 | We have the mean for Variable1 and
Variable2, which are Process1 and Process2.
| | 04:40 | We also have the number of
observations, and then down at the bottom we
| | 04:44 | have our statistics.
| | 04:46 | The first statistic is the z-score,
which indicates how far the difference in
| | 04:51 | these two variances are from the mean.
| | 04:54 | In other words, it's asking how
probable is it that the difference between the
| | 04:58 | means is due to chance?
| | 05:00 | We have the z-score of 2.712, which
means that the probability of the difference
| | 05:06 | of the means being completely
due to chance is less than 1%.
| | 05:11 | It's 0.003, which is equal to 0.3 of a percent.
| | 05:16 | By comparison, the critical value for
the z-score is 1.644, and because the z-
| | 05:21 | score that we got, 2.7, is significantly higher,
| | 05:25 | then we have a good idea that
Process 1 and Process 2 do have different
| | 05:30 | means. And as we can see, the mean
value for Process 1 is significantly
| | 05:34 | higher than Process 2,
| | 05:35 | so we should stay with Process 1, and
either work on refining Process 2 or find
| | 05:40 | another process entirely.
| | Collapse this transcript |
| Inferring results from matched samples selected from two populations| 00:00 | If you've ever been to an eye doctor,
you're probably familiar with the
| | 00:03 | comparison sequence, "which is better, A or B?"
| | 00:07 | It is much the same for
businesses. In many cases,
| | 00:09 | you'll need to determine which of
two competing processes produces
| | 00:13 | more desirable results.
| | 00:15 | In this movie, I'll show you
how to analyze two data sets that
| | 00:18 | use matched samples.
| | 00:20 | A matched sample means that
for every sample in list one
| | 00:23 | there is a matching B sample in list two.
| | 00:26 | In this scenario, Two Trees Olive Oil
Company is testing two different processes
| | 00:31 | to purify olive oil.
| | 00:33 | The values represent the purity
level after Process 1 and Process 2
| | 00:37 | were used on oil from the same batch.
| | 00:39 | You can perform this type of analysis,
which is a paired t-test analysis,
| | 00:44 | using the Paired t-Test tool that's
part of the Analysis ToolPak, to
| | 00:48 | determine whether the differences
between two sets of matched pair
| | 00:50 | measurements are significant.
| | 00:53 | To use the Analysis ToolPak, you click
on the Data tab of the Ribbon, and then
| | 00:58 | at the far right edge of the
Ribbon, you click Data Analysis.
| | 01:02 | If you don't see the Data Analysis item
on the Ribbon on the Data tab, then you
| | 01:07 | need to install the Analysis ToolPak,
and I showed you how to do that in a
| | 01:11 | movie early on in the course.
| | 01:12 | But for now, we'll assume it's installed.
| | 01:14 | So we can click Data Analysis, and then
in the Data Analysis dialog box we can
| | 01:21 | scroll down and click t-Test:
| | 01:24 | Paired Two Sample for Means. Click OK
and the dialog box of the same name appears.
| | 01:31 | So the first thing we need to do is
select the ranges that contain our
| | 01:35 | variables. So I'll click the
collapse dialog button. And then rather than
| | 01:40 | selecting this entire table column by
moving the mouse pointer until it changes
| | 01:44 | into a downward-pointing black
arrow, I'm going to select the cells
| | 01:47 | individually, so the mouse pointer
should be what's called a Greek cross, or a
| | 01:52 | white four-way cross, and I will drag
down to select cells B1 through B15.
| | 02:00 | And the reason I'm doing that is
because I want to select the column label, so
| | 02:04 | that it will appear in our results
and make them easier to understand.
| | 02:07 | So I'll click Expand Dialog button, and
we see that Variable 1 Range is correct.
| | 02:11 | Now I will click in Variable 2 Range,
click the Collapse Dialog button next to
| | 02:16 | that field, and select again.
| | 02:18 | The mouse pointer should be the white
four-way cross. Drag down to cell C15,
| | 02:26 | click the expand dialog button, and we see
that we have our correct cell ranges selected.
| | 02:33 | Now the dialog box asks us to
enter a hypothesized mean difference.
| | 02:38 | It is an optional argument, but it helps
with the calculations if you can put one in.
| | 02:43 | And in this case, I am going to
hypothesize that the difference between the two
| | 02:46 | values is .01. And if you look at the
values in the table, and you look at the
| | 02:51 | pairs, you'll see that that's
not an unreasonable assumption.
| | 02:54 | I am going to select the Labels
check box, because the labels are in the
| | 03:00 | cells B1 and C1. And the Alpha, which is the
significance level, I am going to leave at .05.
| | 03:07 | If you subtract your significance level
from 1, then you find your confidence level.
| | 03:11 | So 1-.05 is 95%, which is the usual
confidence level that's used in statistics.
| | 03:18 | I do want to put my results on a new
worksheet, so I will leave that option
| | 03:22 | button selected. Click OK and I get my results.
| | 03:27 | I'm going to reconfigure the
worksheet so that we can read everything.
| | 03:30 | I'll double-click the column margin at
the right of the A column, so that we see
| | 03:37 | every value there and then Method 1 and
Method 2. I'll double-click for column
| | 03:42 | edges to expand those columns as well.
And I'll zoom in a little bit, so that
| | 03:48 | the values are easier to read.
| | 03:52 | Now we can examine our statistics.
| | 03:54 | So, we find the t statistic is 1.821,
and we can compare that to our one-tailed
| | 04:01 | t statistic of 1.77, and the
corresponding probability of .045 or 4.5%.
| | 04:11 | The t statistic is larger than the
critical one-tail statistic, so that means
| | 04:15 | that the value is significant.
| | 04:17 | However, it is very close to the .05
value, or 5%, so you will probably want to
| | 04:23 | take a larger sample to
analyze this data more thoroughly.
| | 04:27 | Because this example used only 14
observations, you should probably run
| | 04:30 | another batch of tests, preferably with
larger sample sizes, to see if there is a
| | 04:34 | significant difference
between the two processes.
| | Collapse this transcript |
| Analyzing variance using an F-test| 00:00 | In this course, I've shown you several
tests to analyze the characteristics of
| | 00:04 | two data sets to see if the
differences are statistically significant.
| | 00:08 | In this movie, I'll show you how to
use the F statistic, which helps you
| | 00:12 | evaluate a hypothesis that there is a
difference between two groups based on the sample data.
| | 00:17 | Say you have a data set with customer
satisfaction percentages from customers that
| | 00:22 | are from Connecticut and from New York.
| | 00:24 | You can use an F-test to see if the
variances of these two data sets are significant.
| | 00:29 | To do that, you click the Data tab on
the Ribbon, and then in the Analysis group
| | 00:34 | at the far right edge, click Data Analysis.
| | 00:38 | If you don't see the Analysis group or
the Data Analysis button on the Data tab,
| | 00:42 | then you need to install
the Analysis ToolPak add-in.
| | 00:45 | I showed you how to do
that early on in the course.
| | 00:48 | So now we click Data Analysis, and in
the list of Analysis tools, click F-Test
| | 00:55 | Two-Sample for Variances, and click OK.
| | 00:59 | When you do, the dialog
box of the same name appears.
| | 01:02 | Now we need to let Excel know which
variable ranges we're going to use,
| | 01:05 | in other words which
worksheet cells contain which values.
| | 01:08 | So I'll click the Collapse Dialog
button next to the Variable 1 Range box and
| | 01:14 | select A3 through A9.
| | 01:17 | I am selecting the label that will
make our output easier to understand.
| | 01:21 | And click the Expand Dialog button.
| | 01:25 | Now I'll do the same thing
for the Variable 2 Range.
| | 01:28 | So, I'll click there, click the
Collapse Dialog button, select B3 through
| | 01:36 | B9 again, including the header,
click the Expand dialog button, and the
| | 01:41 | variable range appears.
| | 01:43 | I will select the Labels check box,
because I did include the labels in my
| | 01:47 | selection, and I'm going to
leave the Alpha level at 0.05.
| | 01:51 | Alpha is the same as the significance level,
and it is one minus the confidence level.
| | 01:57 | So, for example, an alpha level of
0.05 corresponds to a confidence level of
| | 02:03 | 0.95, or 95%, which is the standard
confidence level used throughout most
| | 02:08 | statistical analyses.
| | 02:09 | I am going to put my
outputs on the same worksheet.
| | 02:13 | So I will click Output Range and
then click the Collapse Dialog button
| | 02:17 | and select cell D3.
| | 02:20 | So that will be the top-left
corner of the table that Excel creates.
| | 02:24 | Click the Expand Dialog button.
Everything looks good, so I'll click OK.
| | 02:29 | When I do, my results appear.
| | 02:31 | Now, I'm going to change the size of
the first column, and click here, so that
| | 02:38 | you can read the values more easily.
| | 02:40 | So you can see the tool calculated the
mean, the variance, and the number of
| | 02:45 | observations, and degrees of
freedom for each of these two columns.
| | 02:49 | At the bottom, we have the F statistic.
| | 02:51 | The F statistic is 1.295.
| | 02:55 | The critical one-tail F statistic--
in other words the value that the F
| | 03:00 | statistic would need to exceed to be
statistically significant for 5 degrees
| | 03:05 | of freedom--is 5.05.
| | 03:08 | So that means that the probability
that the variance or the difference in
| | 03:12 | variance between the two
samples is significant is only 39%.
| | 03:18 | That's well below the threshold of
what would be considered significant,
| | 03:21 | so we can say that based on this small
sample, the difference of the variances
| | 03:26 | between customer satisfaction for
customers from Connecticut and New York is not
| | 03:30 | statistically significant.
| | 03:32 | Calculating an F statistic
helps evaluate whether there is a
| | 03:35 | significant difference between
two groups, but the F-test is very
| | 03:39 | sensitive to non-normality.
| | 03:41 | In other words, if you suspect your
data isn't described by the normal
| | 03:44 | distribution, you should not use the F-test.
| | Collapse this transcript |
| Analyzing variance using ANOVA| 00:00 | ANOVA, or Analysis of Variance, uses the F
statistic I demonstrated in the last movie.
| | 00:06 | Unlike the F statistic by itself, you
can use ANOVA to compare the averages of
| | 00:11 | more than two groups.
| | 00:12 | In this movie, I'll show you how to use
Excel's ANOVA tool to analyze your data.
| | 00:16 | ANOVA analysis checks for the
significance of differences in values based on
| | 00:21 | factors such as age, state of
residence, or education level.
| | 00:25 | As an example, as you can see in this
worksheet, let's suppose that you perform a
| | 00:29 | series of customer satisfaction surveys
and group the results by customer type:
| | 00:34 | Restaurant, Reseller, or a Private individual.
| | 00:37 | You can perform an ANOVA test to
determine whether customer satisfaction levels
| | 00:41 | vary significantly among those categories.
| | 00:44 | When you analyze this data, you set a
null hypothesis, which states that the type
| | 00:49 | of customer does not
significantly influence satisfaction.
| | 00:52 | To perform the analysis itself, you go
to the Data tab of the Ribbon and then in
| | 00:58 | the Analysis Group, click Data Analysis.
| | 01:01 | If you don't see the Analysis Group or
the Data Analysis button in the Analysis
| | 01:06 | Group on the Data tab, then you need
to install the Analysis ToolPak add-in.
| | 01:10 | I showed you how to do that in
a movie early on in this course.
| | 01:13 | But now let's assume it's
installed, and click Data Analysis.
| | 01:18 | Then in the dialog box
that appears, click Anova:
| | 01:22 | Single Factor, and click OK.
| | 01:25 | In the dialog box that appears
we can select our Input Range.
| | 01:29 | So I'll click the Collapse dialog
button, and then select cells A3 through C9,
| | 01:36 | and click the Expand dialog button,
and the range we selected appears in the
| | 01:41 | Input Range. And our labels are in the first row,
| | 01:44 | so I will select that check box. And I
selected the labels so that our output
| | 01:49 | will be easier to understand.
| | 01:51 | Now, the Alpha level is 0.05, and alpha
is the same as the significance level,
| | 01:57 | which is the confidence level subtracted from 1.
| | 02:00 | So in other words, an alpha of 0.05 is
the significance level of 0.05, and it's
| | 02:05 | equivalent to a confidence level of 0.95 or 95%.
| | 02:10 | That's the confidence level
used in most statistical analysis.
| | 02:14 | Now, I can select my output options,
and I will use an output range.
| | 02:19 | I'm going to make the
leftmost cell of my output E3.
| | 02:24 | So, I'll select there, click the Expand dialog
box, everything looks good, and I'll click OK.
| | 02:33 | I'll zoom out just a touch, so that you
can see the data, and I'll expand column
| | 02:39 | E, so that you can read the labels.
| | 02:41 | So we have our Anova:
| | 02:42 | Single Factor analysis, and the
factor we're using is customer type.
| | 02:46 | So we have Restaurant, Reseller, and
Private, and you see the different summary
| | 02:49 | statistics for each of those.
| | 02:51 | So, for example, you'll notice that
the reseller's average satisfaction rating
| | 02:56 | is larger than that for
restaurant, or private customers.
| | 02:59 | The only question is is this significant?
| | 03:02 | For that, we can look down at the
ANOVA table and analyze it this way.
| | 03:06 | The bottom section of the results
contains the P-value, which is 0.059.
| | 03:12 | That value is greater than 0.05, so
that means that we can't quite reject
| | 03:17 | the null hypothesis that satisfaction
differs significantly according to customer type.
| | 03:22 | You can also compare the F
statistic to the F critical statistic.
| | 03:27 | So the F statistic is 3.43, and
the critical value for F is 3.68.
| | 03:34 | So the probability of customer type being
significant is very, very close to being true.
| | 03:40 | Based on this analysis and the fact
that you only took six surveys for each
| | 03:45 | customer type, you should probably do
another survey, probably with a larger
| | 03:49 | sample size, in the near future.
| | Collapse this transcript |
|
|
8. Using Linear and Multiple RegressionCreating a linear regression line for an XY (scatter) chart| 00:00 | Anytime you have two data sets that
you think might be related, you can
| | 00:03 | use Excel's Regression tools to analyze
the relationship, and see if it's significant.
| | 00:08 | In this case, for example, let's say
the Two Trees Olive Oil Company wanted
| | 00:12 | to compare the number of articles written about
it in the popular press versus its yearly sales.
| | 00:17 | To do that, you can create an XY
scatter chart and then add a trendline to see
| | 00:22 | if there's a significant
relationship between those two sets of values.
| | 00:26 | To create the chart, you select any
cell in your Excel table, and then on the
| | 00:31 | Insert tab, click Scatter, and then click the
first subtype, which is Scatter with only Markers.
| | 00:42 | When you do, the Excel
creates the XY scatter chart.
| | 00:44 | We have the number of articles along
the bottom and sales volume along the top.
| | 00:51 | Now we can add a trendline to see if
there is a significant relationship between
| | 00:55 | these two sets of values.
| | 00:57 | To do that, you go onto the Layout
contextual tab, click Trendline, and
| | 01:03 | click Linear Trendline.
| | 01:06 | Excel displays the trendline, and you see
that it has a fairly steep, upward slope,
| | 01:10 | and that is a good indicator that there
is a relationship between the two variables.
| | 01:14 | We can go step further and add the
correlation coefficient, which I talked about
| | 01:19 | earlier in this course, for this trendline.
| | 01:21 | So click Trendline, click More Trendline
Options, and then click Display Equation
| | 01:28 | on chart and Display R-squared value on chart.
| | 01:33 | When I click close, you can see
that the values have appeared.
| | 01:36 | You use the equation to draw the
trendline by taking the value of x--and x is
| | 01:42 | the number of articles--multiplying it by
6227.3, and then adding a base value of 55,413.
| | 01:52 | The R-squared value is the correlation
coefficient squared, and that is a measure of
| | 01:57 | the line's goodness of fit.
| | 01:59 | Excel creates the trendline based on a
series of computations that find the line
| | 02:03 | that best fits its data.
| | 02:05 | It's a lengthy process and one
that's ideally suited to computers.
| | 02:08 | In this case, the R-squared
coefficient of 0.9497 is extremely high.
| | 02:15 | Any value over 0.8 is considered to be
strong. Any value over 0.9 is considered
| | 02:20 | to be extremely strong.
| | 02:21 | So it appears that there is a very
high correlation between the number of
| | 02:25 | articles written about Two Trees
Olive Oil Company and their yearly sales.
| | Collapse this transcript |
| Analyzing and interpreting data from the Regression tool| 00:00 | When you add a trendline to a chart,
you have the option to display the line's
| | 00:04 | equation and R-squared value.
| | 00:06 | If you'd like more information,
including the data set's variance and standard
| | 00:10 | deviation, you can display those values
in your workbook by using the Regression
| | 00:14 | tool in the Analysis ToolPak.
| | 00:16 | I've already created a chart that has the
equation and the R-squared value for this dataset.
| | 00:22 | So in this case, a y value is equal to
1.7394 times X, which are the values along here,
| | 00:30 | or the values in the left-hand
Impressions column, and you add a value of .7333
| | 00:36 | to generate the y value.
| | 00:38 | If you want more information, you
can use Excel's Regression tool.
| | 00:42 | To do that, you go on to the Data tab
of the Ribbon and then in the Analysis
| | 00:47 | group, click Data Analysis.
| | 00:50 | If you don't see the Data Analysis
button or the Analysis group on the Data
| | 00:53 | tab of the Ribbon, that means that you might
not have installed the Analysis ToolPak add-in.
| | 00:59 | I show you how to do that in a
movie earlier on in this course.
| | 01:02 | But now let's assume that it has been
installed and click Data Analysis, and then
| | 01:07 | in the dialog box that
appears, click Regression.
| | 01:10 | So I have selected Regression, and click OK.
| | 01:14 | Now in the Regression dialog
box, I can select my values.
| | 01:17 | The Y range, which will represent the
vertical axis in the chart, are the Sales.
| | 01:23 | So I'll click the Collapse Dialog
button and then select cells B1 through B11.
| | 01:29 | Now notice that I am not selecting the
entire table column, because if I do that,
| | 01:34 | such as by clicking the column
header when my mouse pointer is a downward-
| | 01:38 | pointing black arrow, I
would only select the data cells;
| | 01:41 | I wouldn't select the label.
| | 01:43 | I do want the label,
| | 01:43 | so I am going to leave the mouse
pointer as a four-way white cross and drag
| | 01:49 | from B1 through B11.
| | 01:50 | And it appears in the dialog box, so
I'll click the Expand Dialog button.
| | 01:56 | Now I can select the values for the X
range by clicking in the X Range box,
| | 02:01 | clicking the Collapse Dialog button,
and selecting A1 through A11. Click the
| | 02:09 | Expand dialog button, and
now I can set my other options.
| | 02:13 | I did select Labels that will
make our data easier to understand.
| | 02:16 | So I will select the Labels check box.
| | 02:18 | I am going to put this data on a
new worksheet ply or a new worksheet,
| | 02:24 | so I will select that option button.
| | 02:26 | I don't need to select anything else,
| | 02:28 | so I will click OK and have
Excel generate the results.
| | 02:32 | Now the results are here on the new
worksheet. But rather than work with them
| | 02:36 | here, I am going to go to a sheet that
I prepared earlier that has the original
| | 02:40 | data, the chart, and the summary output all
on the same worksheet, and that is on Sheet3.
| | 02:44 | So now we can look at all
the data at the same time.
| | 02:49 | The chart displays the R-squared value as
well as the equation used to generate
| | 02:55 | this line. And you can see those same values
over here in the bottom part of the Output.
| | 03:00 | So, for example, the Impressions
coefficient of 1.739, 39 repeating, is
| | 03:07 | equivalent to the first term in
the equation, which is 1.7394 times X.
| | 03:10 | So in other words, we are multiplying
the impressions by this number here and
| | 03:17 | then adding the coefficient of the
intercept, which is .7333, repeating.
| | 03:24 | The intercept is the value
that this line crosses the Y axis.
| | 03:29 | So in other words, we start at .7333,
repeating, and for every value in X, we
| | 03:35 | multiply X by 1.7394 and go up that much.
| | 03:40 | You might recognize this formula as the
slope intercept formula from Algebra 1.
| | 03:44 | Now the R value, which is the
correlation coefficient, appears here.
| | 03:49 | It's called Multiple R. And again,
the correlation coefficient of .99 is
| | 03:54 | extremely high. The maximum value is 1,
so any value of .99 is extremely high.
| | 04:00 | That means that there is a very strong
correlation between the number of add
| | 04:04 | impressions that an individual has seen and
the number of sales that you make to them.
| | 04:08 | The R-squared value is a
more conservative measure.
| | 04:11 | It is this value squared.
| | 04:13 | So, for example, if you square the
value .5, you get .25, which is extremely low.
| | 04:19 | So an R-squared of .98 is extremely large.
| | 04:23 | By the same token, the standard error of
.79 is extremely low in relation to the
| | 04:29 | values in the Sales column.
| | 04:32 | So that means that we have a very good sense
that the number of impressions does affect sales.
| | 04:39 | I cover calculating and interpreting an
F score, which you can find here, using
| | 04:43 | ANOVA techniques in another movie, but
note that the associated probability is
| | 04:47 | also extremely low--that is the
significance. That indicates that there is very
| | 04:52 | little likelihood that the values'
variances are due solely to chance.
| | Collapse this transcript |
| Estimating and predicting data using the regression formula| 00:00 | When you perform linear regression
analysis in Excel, the program generates an
| | 00:03 | equation that best describes the data.
| | 00:06 | Excel uses that equation in numerous ways.
| | 00:09 | In this movie, I'll show you how to
forecast future values without creating a
| | 00:13 | trendline or a chart.
| | 00:15 | If you want to forecast future values
based on existing values, you can use
| | 00:19 | the FORECAST function.
| | 00:20 | So in cell E2, I am going to type
=FORECAST(, and now I need to input my X, and
| | 00:28 | that would be the year.
| | 00:29 | I am working with yearly sales data, so
I want to know, based on current trends,
| | 00:33 | what would a future year's sales be?
| | 00:35 | So that will be in cell E1.
| | 00:37 | That cell is currently blank, but
I'll fill in a value in a moment.
| | 00:40 | Now we have the known Ys.
| | 00:44 | The Ys are the dependent variable,
and in this case those are the Sales.
| | 00:49 | My Year is my independent
variable. Those just happen.
| | 00:52 | There is no variation.
| | 00:53 | It's always going to be one year later
when it's one year later. But Sales can
| | 00:57 | vary, and they vary by year.
| | 00:59 | So we have E1 and our known Ys are in the
second column, the Sales column of the table.
| | 01:06 | I can select that column by moving
the mouse pointer over the table column
| | 01:10 | header, and when it's a downward-
pointing black arrow, clicking, and you can see
| | 01:14 | I get my table reference.
| | 01:17 | And now we have the known Xs, which are
in the Year column, so I can select that.
| | 01:21 | Type a right parenthesis, press Return,
and we get a very small value, and
| | 01:27 | that's because there is no
year, so it's using a zero.
| | 01:29 | Now, let's say that I want to discover
the potential sales for the year 2011.
| | 01:33 | If I type in 2011 and press Return,
then we get a value of 480,285.7, which just
| | 01:41 | on visual inspection is consistent
with the values here in the worksheet.
| | 01:47 | If I change the year to 2012, then we
get a value of 525,000, which again is
| | 01:54 | consistent with our
experience and our existing data.
| | 01:56 | Now there is one other way to use
linear regression in Excel, and that is by
| | 02:02 | extending a data series.
| | 02:03 | You might remember from your other
experiences with Excel that if you type in
| | 02:08 | two values in a worksheet cells, select
them and then grab the fill handle at
| | 02:14 | the bottom and drag down, Excel
extends the series, so 1, 2, 3, 4, 5, 6, 7,
| | 02:18 | 8, 9, 10, and so on.
| | 02:21 | You can do the same thing with
data that doesn't follow a regular
| | 02:24 | progression and Excel will
calculate the Linear Regression formula and
| | 02:29 | extend the data by those amounts.
| | 02:31 | So let me show you how to do that in the table.
| | 02:34 | Let's say that you select cells A2
through B8. And this time I'm not selecting
| | 02:39 | the labels because they're not numerical values.
| | 02:42 | If I select all of these cells--and
it is important that you select the
| | 02:45 | entire data set or your results might not
be correct--and then you grab the fill
| | 02:49 | handle and drag down, you can see the tooltip
indicating the year that you're going to display.
| | 02:55 | So we'll go all the way up to 2015,
release the mouse button, and Excel extends
| | 03:00 | the series using the Linear Regression formula.
| | 03:04 | Each of these techniques lets you
extend the data series using the linear
| | 03:07 | regression shown earlier in the course.
| | 03:09 | I personally use the fill
handles to project future values.
| | 03:12 | Just be sure that you select the
entire data set, so Excel takes every value
| | 03:16 | into account when it
derives its regression equation.
| | Collapse this transcript |
| Developing a multiple regression model| 00:00 | In the previous movies in this chapter,
I showed you how to have Excel use
| | 00:03 | linear regression to derive an
equation used to project future values, and to
| | 00:08 | estimate how much of the value's
variance was due to the data in your samples.
| | 00:12 | In this movie, I'll show you how to
perform a similar analysis for multiple factors.
| | 00:16 | This technique is called multiple regression.
| | 00:19 | When you create your source data table,
you need to have two regressors, and
| | 00:23 | those are the independent variables you
want to use in the Regression formula to
| | 00:27 | derive the dependent variable.
| | 00:28 | Your two regressors--in this case
Impressions, which is the number of times an
| | 00:33 | individual sees one of your company's
ads, and Proximity which is the distance
| | 00:37 | in miles that a person lives from your store--
| | 00:40 | we want to see how these two factors
when working together relate to the revenue
| | 00:46 | that you generate, or in other
words the amount that you sold them.
| | 00:49 | To perform multiple regression analysis,
you use the Data Analysis tool, and
| | 00:55 | that is on the Data tab in the Analysis Group.
| | 00:58 | If you don't see the Analysis Group or
the Data Analysis button on the Data tab
| | 01:02 | of the Ribbon, then you need to
install the Analysis ToolPak add-in.
| | 01:06 | I showed you how to do that in
a movie early on in the course.
| | 01:10 | With the Analysis ToolPak installed, we
can click Data Analysis, and then in the
| | 01:15 | Data Analysis dialog box,
click Regression, and click OK.
| | 01:21 | When we do, the Regression dialog box
appears, and we can select our input ranges.
| | 01:27 | The first input range is the Y range.
| | 01:30 | In regression analysis, the Y values are the
dependent values, and those are the revenue.
| | 01:36 | So if I click the Collapse Dialog
button and click the Revenue column header,
| | 01:42 | when my mouse pointer is a downward-
pointing black arrow, Excel will select all
| | 01:47 | of the data in that column, and I
can click the Expand Dialog button.
| | 01:52 | Now we need to select my input ranges for
the X values, and those are the regressors.
| | 01:56 | So if I click the Collapse Dialog
button and then drag to select columns B
| | 02:05 | and C, and click Expand Dialog button,
B2 through C11 comes in as the input
| | 02:10 | range for the X values.
| | 02:12 | Now, I can set my output range.
| | 02:14 | This is a fairly large table, so
I will put it on a new worksheet.
| | 02:17 | That is selected. Everything looks good,
| | 02:19 | so I can click OK, and
Excel performs the analysis.
| | 02:23 | I'm going to expand the columns,
so it displays the entire value.
| | 02:27 | I'll do that for A, B, C, and F, G, H,
and I. And I'll click so that it's no
| | 02:37 | longer selected, and we
can see all of the values.
| | 02:40 | I will also increase the zoom because
we're going to focus on the data here at
| | 02:45 | the top, as opposed to the data on the
bottom, because you'll be familiar with
| | 02:48 | this data from previous
discussions of regression.
| | 02:50 | So I'll just zoom to about
130%, and now we can look at the
| | 02:57 | regression statistics.
| | 02:58 | Specifically, I want to
focus on the R Square value.
| | 03:03 | The R Square value is also called
the goodness-of-fit measure and what it
| | 03:07 | measures is how closely the equation
that Excel derives explains the data.
| | 03:11 | In other words, how predictive is it.
| | 03:13 | Given the two regressors, which were ad
impressions and proximity to the store,
| | 03:17 | how well do those two factors in
combination describe your data?
| | 03:21 | You can interpret the R-squared value as
a percentage of the final value that is
| | 03:25 | predicted by the two regressors, and
in this case, that percentage is 18.6%.
| | 03:31 | That's an extremely low value, which
means that the two variables together do not
| | 03:35 | do a good job of explaining
the variability in revenue.
| | 03:38 | When you see a multiple regression
result like this, which returns an extremely
| | 03:41 | low R-squared value, you should perform
a simple linear regression analysis on
| | 03:45 | the independent variable separately to
see if one independent variable has a
| | 03:49 | strong relationship to the dependent variable.
| | 03:52 | In this case, you would check Ad
Impressions to see if related, and Proximity
| | 03:57 | to see if it related.
| | Collapse this transcript |
|
|
ConclusionAdditional Resources| 00:00 | Thanks for taking the time to work
through "Excel 2007 Business Statistics".
| | 00:04 | I hope you've learned a lot, and that
you can apply the techniques you have
| | 00:07 | learned to your business.
| | 00:09 | If you want to learn more about statistics,
there are several resources that you can turn to.
| | 00:13 | The first is a book by Professor Neil J.
Salkind from the University of Kansas,
| | 00:18 | and his book is called "Statistics for
People Who (Think They) Hate Statistics".
| | 00:22 | It's a great entry-level introduction,
and in my opinion, it is the best
| | 00:26 | statistics book under $100.
| | 00:29 | The best business statistics book that
you can find over $100, in my opinion, is
| | 00:34 | "Business Statistics: A First Course (5th Edition)"
by Levine, Krehbiel, and Berenson.
| | 00:40 | They go into a lot of detail,
many different techniques,
| | 00:43 | so if you want to add some very advanced
techniques to your toolkit, that's the book to get.
| | 00:49 | And finally, on a related note, we have
"Excel 2007 Financial Analysis", one of my
| | 00:54 | own courses that's available on
the lynda.com Online Training Library.
| | 00:58 | Thanks again for your time, and best of
luck with "Excel 2007 Business Statistics".
| | Collapse this transcript |
|
|