navigate site menu

Start learning with our library of video tutorials taught by experts. Get started

Excel 2007: Business Statistics

Excel 2007: Business Statistics

with Curt Frye

 


Excel 2007: Business Statistics with Curtis Frye shows managers and executives how to assess their organization's data effectively by applying statistical analysis techniques. The course covers important statistical terms and definitions, and then dives into techniques using the tools in Excel: formulas and functions for calculating averages and standard deviations, charts and graphs for summarizing data, and the Analysis ToolPak add-in for even greater insights into data. Exercise files are included with the course.
Topics include:
  • Understanding statistical terms
  • Creating a basic Excel table
  • Auditing formulas
  • Creating frequency distributions for qualitative data
  • Calculating a running total
  • Creating a histogram
  • Using PivotTables
  • Calculating mean, median, mode, and other numerical data
  • Using probability distributions
  • Population sampling
  • Testing hypotheses
  • Developing liner and multiple regression models

show more

author
Curt Frye
subject
Business, Data Analysis, Finance
software
Excel 2007
level
Intermediate
duration
4h 19m
released
Dec 08, 2010

Share this course

Ready to join? get started


Keep up with news, tips, and latest courses.

submit Course details submit clicked more info

Please wait...

Search the closed captioning text for this course by entering the keyword you’d like to search, or browse the closed captioning text by selecting the chapter name below and choosing the video title you’d like to review.



Introduction
Welcome
00:04Hi! I'm Curt Frye. Welcome to Excel 2007 Business Statistics.
00:09In this course, I'll show you powerful techniques you can use to analyze
00:12your business data to help you make better decisions about your company and its operations.
00:17I'll start by showing you how to use Excel's built-in tools to manage your data,
00:21examine your formulas, and install the Analysis ToolPak add-in which greatly
00:26expands your ability to analyze data in Excel.
00:29Next, I'll define important statistical terms, discuss the two different types
00:34of statistical analysis, and provide an overview of how you can use statistics
00:38to help run your business.
00:40With that information in mind, I'll dive into the specifics of statistics in
00:44Excel, showing you how to summarize your data by using charts and graphs,
00:48creating frequency distributions for non-numerical data, and by calculating
00:53averages, standard deviations, and other useful statistics.
00:58The last half of the course presents more advanced techniques, such as analyzing
01:01several distinct data distributions, inferring information about a data set
01:06based on a sample, and testing hypotheses.
01:10I'll end the course with a look at forecasting and linear regression, techniques
01:14that you can use to plan the future of your business.
01:17Excel 2007 provides powerful statistical capabilities that help you gain useful insights.
01:22I'm sure you'll find this course will be time well spent.
Collapse this transcript
Using the exercise files
00:00If you are a premium member of the lynda.com Online Training Library, or if
00:04you're watching this tutorial on a DVD-ROM, you have access to the exercise
00:09files used throughout this title.
00:11The exercise files are organized within a folder named Exercise Files.
00:16If you double-click it, you'll see a folder for each of the chapters.
00:21Just to show you what the files look like within the chapter, we'll open
00:24Chapter03, and you can see that you have the exercise files.
00:28So, if you are watching a movie that's in Chapter 3, all you need to do is look
00:32in your Chapter03 exercise files folder, and you'll be able to follow along
00:36exactly as I'm working.
00:38If you are a monthly subscriber, or annual subscriber to lynda.com, you don't
00:42have access to the exercise files, but you can follow along from scratch
00:46with your own assets.
00:47Let's get started!
Collapse this transcript
1. Introducing Statistics
Introducing statistical analysis
00:00Numbers are everywhere, especially in business where your company's health and
00:03future depend on your ability to make good decisions.
00:06You can use statistical methods to analyze your data to support that work.
00:10Statistics is the practice of collecting, analyzing, and interpreting data.
00:16There are two major branches of statistics, descriptive statistics and
00:21inferential statistics.
00:23You use descriptive statistics when you know all of the values in a dataset.
00:27For example, if you examine the olives on a tree, you can count the number of
00:31olives, in this case 4,000.
00:33Then you can find the number of green, olives 3,000, the number of black olives,
00:37which are 1,000, and then find the proportion.
00:40In this case, the green olives are 75% of the total, and the black olives
00:44are 25% of the total.
00:46For inferential statistics, you take information from part of your population
00:50and extrapolate it to make guesses about your entire population.
00:55So, for example, let's say that you have a grove of 100 olive trees.
00:59You take a sample from five of those trees.
01:02You have 21,000 olives, of which 14,070 are green and 6,930 are black.
01:08You see that the percentages are green is 67%, and black is 33%.
01:14Now that's true for those five trees,
01:16but it might not be in effect--almost certainly will not be--exactly true for the
01:21entire grove of 100 trees.
01:23So what you need to ask is how close am I to the true value, and how confident
01:28am I in that estimate?
01:30When you ask how close, you're asking how close your measured value is to the true value.
01:34So, for example, black olives make up 33% of your sample.
01:38Well, the total might be 33%, but most likely it varies a little bit.
01:43So, for example, it might be 33% plus or minus 2%.
01:47So the total percentage of black olives might be between 35% and 31%. Indicating
01:53how close gives you a range.
01:55Next, you ask, how confident am I in that range?
01:58So you're asking what percentage of random samples taken from this population
02:02would fall within the range of, say 31% to 35%.
02:06The statement of how confident you are is what's called a confidence interval.
02:10I'm going to switch from olives, and use a more familiar example perhaps, when
02:14it deals with political polling.
02:17So if we say that Governor A has an approval rating of 60% with a margin of
02:22error on the survey of 3%,
02:24how confident are we of that?
02:27Most of the time, the confidence level that you will find in statistics is 95%.
02:32So in other words, 95% of the samples that we take from the entire population
02:37should produce an approval rating of between 57% and 63%.
02:41The other 5% would generate values outside of that range, for example, 55%,
02:4554%--or on the high side, 70% or 75%.
02:50But we know from inferential statistics that 95% of the time, the approval
02:54rating for a survey will be within 3% plus or minus of the actual true approval rating.
03:01So the best way to phrase that statement is, we are 95% certain that Governor A
03:07has an approval rating of 60%, plus or minus 3%.
03:11When you read an article online or pick up a newspaper, try to find the story
03:15that doesn't refer to numbers in some way.
03:17You probably won't be able to do it.
03:18Numbers are everywhere.
03:20But once you master the techniques in this course, you'll have an easier time
03:23making sense of your own data and interpreting data that others give you.
Collapse this transcript
Understanding statistical terms
00:00Statistics is a wide-ranging field of study.
00:02Anytime you describe, categorize, or interpret data, you're performing
00:07statistical analysis.
00:09In this movie, I will introduce and define terms we'll encounter throughout the course.
00:13The most basic term in statistics is data.
00:17So, what constitutes data?
00:19I prefer to define the term "data" as values that describe an attribute of
00:23an object or an event.
00:25I've made by definition vague on purpose.
00:27Almost anything can be considered data.
00:30For example, you could record an employee's salary, the number of bottles of
00:34olive oil your company sold last week, or how much your company spent on boxes last year.
00:39Finally, just a quick note on usage:
00:41the term data is technically plural, and singular is datum if you're referring
00:46to a single data point,
00:48but you usually only hear people use the word datum in academic circles,
00:52so data is fine in casual conversation.
00:55Next, you can talk about a data set.
00:56A data set is made up of two components:
01:00observational units and variables.
01:03The observational units are people or things or events on which data is collected.
01:08As an example, suppose you have a data set that contains the name, age, and
01:13salary of every employee at the company.
01:16The employees are the observational units, and their names, ages, and
01:19salaries are the variables.
01:22The term "population" refers to every member of the group of interest.
01:26The data set in the example above is population data, since every employee is in the data set.
01:31You might also hear about a researcher referring to a sample.
01:35A sample is a subset of a population.
01:38You can select a sample using any number of rules,
01:40but the most statistically valid way to do so is to select a random sample.
01:44I used the word "random" in a previous definition,
01:48so let me be explicit about what it means to say that something is random.
01:51A random value is a value selected entirely by chance.
01:55Values generated using an algorithm, such as the one behind Excel 2007's RAND,
02:00function aren't truly random.
02:02They are what are called pseudorandom numbers.
02:05In other words, the numbers look random at first glance,
02:07but if you know the algorithm and the starting point, you could predict the
02:10sequence of numbers that come out.
02:12Pseudorandom numbers are usually good enough for most applications,
02:15but it is possible to generate truly random numbers using the physical systems
02:20by tracking weather phenomena and the like.
02:23Next, I'd like to mention data distributions.
02:26When you examine your data set, you can calculate how often individual values
02:29occur, and plot those frequencies on a chart.
02:32The curve that contains those frequencies is called the distribution.
02:36You might be familiar with the normal distribution, which is more commonly known
02:39as the bell-shaped curve.
02:41In this chart, you see that there are more values near the average and that
02:45values are distributed with decreasing frequency as they move away from the average.
02:48There are other distributions in addition to the normal distribution.
02:52I'll cover them later in this course.
02:53Anytime you describe an event, you could assign either a probability of the
02:58event will occur or the odds that the event will occur.
03:01Probability and odds are related and often confused concepts.
03:05So let's start with probability.
03:06Probability is the likelihood of something happening that is expressed using a
03:10percentage or phrasing, such as X out of Y times.
03:14For example, if you flip a coin, it should come up heads one out of two times.
03:19If you state the probability as a percentage, you would say that a coin is 50%
03:22likely to land with the head side up.
03:25Odds, by contrast, are an expression of the likelihood of an event occurring as
03:29compared to the number of times it won't.
03:31For example, three are six numbers on a casino die,
03:34so the odds of the number four appearing are five to one against.
03:38That is, five ways the number four won't appear versus one where it will.
03:42The final term that I would like to define is correlation, which is the tendency
03:47for changes of one variable to be related to changes in another variable.
03:51For example, a person's height at maturity is strongly correlated with his or
03:56her parent's heights.
03:57Now it's very important that you not confuse correlation with causation.
04:01Just because two values follow the same pattern, doesn't mean the changes in one
04:05causes changes in the other.
04:07Finally, I'd like to show you this comic strip from xkcd.com, which points out
04:11some of the problems with correlation and causation.
04:14In this movie, I've defined some of the terms I'll use throughout the course.
04:18I'll expand on these definitions as I go along,
04:20but in the next movie, I'll show you how to apply statistics in a variety
04:24of situations.
Collapse this transcript
Using statistics
00:00In this movie, I'll show you a few ways you can use statistics to analyze your
00:04company's data to gain valuable information about your operations.
00:07For example, you can use statistics to discover the relative popularity of your products.
00:12For example, if the Two Trees Olive Oil Company sells four different infusions
00:17of olive oil, then you can create a pie chart that indicates the share that each
00:21of those infusions has toward total sales.
00:24So, for example, you could indicate the numbers for lemon, for the mandarin
00:29infusion, and so on.
00:31You can also use descriptive statistics to describe your data.
00:35So let's say, for example, that you have a table that contains a month's
00:38worth of sales data.
00:40You can use the descriptive analysis tools to summarize your data, and you get
00:44quite a few different measures.
00:45You have the mean, which is the average, standard error, the median.
00:50You get an entire range of values, and
00:51I'll show you how to use most of these within the course.
00:55Next, you can analyze how your data is distributed.
00:57So let's say once again that you are looking at daily sales for the Two Trees
01:01Retail Shop, and you find that your average day's sales is 2,000 with a standard
01:08deviation, which I'll explain many times during the course, is 500.
01:12That means that the vast majority of your sales days will have totals between
01:17$1,500 and $2,500, with values around 3,000 or 3,200 being much more rare.
01:25So if you have a number of extremely high or extremely low sales days in a row,
01:29you need to look at what's going on to see if you can make any decisions or
01:33changes about your operation. Or if you have higher sales than you expected, try
01:38to find out what's going right and keep doing it.
01:40Finally, you can use linear regression techniques to determine how your sales
01:45might improve in the future.
01:46For example, if the Two Trees Olive Oil Company has had a number of articles
01:50written about it in the press, you might want to see how closely the number of
01:54articles relates to the number of sales.
01:57To indicate that in the chart, you would draw a trendline or linear regression
02:02line to see how well those two variables move together,
02:05in other words, how strongly they are correlated.
02:08Analyzing data using statistical methods helps you discover information about
02:12your company's operations.
02:14But assessing that information to make decisions is a human skill.
02:17You need to take your business environment, competition, and operations into
02:21account when you decide how to use the information you develop.
Collapse this transcript
2. Learning Useful Excel Techniques
Creating and using an Excel table
00:00Excel 2007 introduced a new tool that helps you manage your worksheet data effectively.
00:05That tool is the Excel table.
00:07You'll be using Excel tables throughout this course, so I wanted to show you how
00:11to create them and also how to refer to them within formulas.
00:16To create an Excel table from existing data, you should have a data list with
00:20column headers, no blank rows, and no extra data around the table data.
00:25So, for example here in the cell range A2 through B15, you'll see that I have
00:30dates 1 January through 13 January, and then also sales data.
00:35I have nothing: no headers, no extraneous data in any cell around the table data.
00:41That way, when I create the table, the only thing that would be included is this
00:44dat. And also, I have no blank rows; everything is taken up.
00:50It's okay if you have a blank cell. Say, for example, if I didn't have a value
00:53for January 9, that cell could be blank.
00:56But both cells could not be blank, because then that would break the table, it
01:00would break the chain, and Excel, when you create the table, would only include
01:04the values in those cells, including the headers.
01:09To create the table, you click any cell in the data list that you want to make a
01:13table, and then on the Home tab of the Ribbon, click Format As Table, and then
01:19select a table style.
01:21When you do, Excel displays the format as Table dialog box.
01:25Verify that the range it identified, A2 through B15, which is correct here--
01:30it's in the moving outline. That is correct.
01:33Then also My table has headers,
01:36that check box should be selected, because Sales and Date are the headers.
01:42Everything looks good, so you can click OK, and Excel creates the table.
01:47A couple of things to notice about the table:
01:48First, you will notice that a new contextual tab appeared, and that is the Design
01:54contextual tab under Table tools.
01:56That allows you to affect the table as a whole.
01:59Also, you'll see that the table headers have filter arrows that you can use to
02:04filter and sort your data within the table.
02:07Another advantage is that your table has a name.
02:11Now, the name, when it starts, is going to be something like table 1, table
02:142, table 12, whatever.
02:16But because you are going to be referring to your table data within formulas,
02:20you should rename the table, so it's more descriptive.
02:22That makes your formulas easier to understand.
02:26To rename a table, on the Design contextual tab, click in the Table Name box, and
02:31then type the new name.
02:31I will call this one "Daily Sales," and press Return.
02:37Now, instead of referring to cells, say B3 through B15, which is the sales data,
02:43you can use the table name.
02:45To do that, you type in equal sign, and in this case I'll just make a SUM
02:49formula to find the total of those values, left parentheses, and then you start
02:54typing in the name of the table, and that's "Daily Sales."
02:59You see that Excel, in the formula auto complete list, displays the name of the table here.
03:05While it's highlighted, I can press the Tab key to accept it. But I'm not quite
03:09there, because I want to refer to the data in the sales column.
03:13To refer to a column, type a left square bracket, and then you'll see a list of
03:19the available columns, and other elements of the table that I won't go into now.
03:24I want to use the sales data,
03:25so I'll type in S. While Sales is highlighted, I'll press Tab to accept it, so I
03:31have the Sales column.
03:33Then I press a right square bracket and a right parenthesis to close the
03:38formula. Everything looks right.
03:39Press Enter and Excel creates the formula.
03:43Now, the formula that I created indicates the name of the table and the column
03:49from the table that I'm using in the formula.
03:52It makes it much easier to understand, especially if you either come back to a
03:55worksheet after you've been away from it for a while, or if you need to share
03:59your workbook with a colleague and they need to find out how the formulas work.
04:03If you add a row of data to your table, then Excel updates the formula to accept it.
04:08So, for example, if I click cell B15 and then press Tab, Excel creates a new table row.
04:14Then I can type the date, 1/14, which is rendered as 14-January.
04:18Then if I type in a value such as 5,000, Excel updates the formula's result, but
04:24notice that the formula has not changed.
04:27In other words, it didn't change to include cell B16 within the formula
04:30explicitly, because that's implied by Daily Sales and the Sales column.
04:36Excel tables let you manage your data easily and refer to your data and formulas
04:40without using cell references, which can be hard to understand.
04:44Be sure to rename your table to reflect the data it contains, which makes your
04:47formulas even more effective.
Collapse this transcript
Installing the Analysis ToolPak
00:00Another element of Excel 2007 that you will use several times during this course
00:05is the Analysis ToolPak.
00:07The Analysis ToolPak is an add-in that comes with Excel 2007, but isn't
00:11necessarily available when you install the program.
00:14In this movie, I'll show you how to install the Analysis ToolPak and to access
00:18its tools from Excel.
00:20An add-in, as the name implies, is a group of capabilities you can add to Excel.
00:25To make the data Analysis ToolPak available, first you click the Office button,
00:30and then click Excel Options.
00:33In Excel Option dialog box, click Add- Ins to display that page of the Excel
00:39Options dialog box, and then you'll see a list of inactive application add-ins.
00:43And right at the top of the list is the Analysis ToolPak, and its type is Excel add-in.
00:50To make the Analysis ToolPak available to install it with an Excel, go down to
00:55the Manage list and if necessary, select Excel Add-ins--again, that's the type of
01:01the Analysis ToolPak--and then click Go.
01:05Then in the Add-Ins dialog box, select the Analysis ToolPak check box.
01:10You don't need the Analysis ToolPak - VBA add-in because that stands for Visual
01:15Basic for Applications, and you won't be doing any programming.
01:19Click OK and Excel adds the add-in.
01:23If you didn't do a complete installation of Excel 2007, then you might see a
01:28dialog box that asks you if you want to configure Excel 2007.
01:33Indicate that you do, and you might need to have your Office 2007 installation
01:37disk handy for Excel to pull the files on.
01:40But once you get the add-in installed, then you can use it, and after you make
01:45the add-in available, it appears on the Data tab. And you see it here in the New [00:01: 51.04] Analysis Group, and it is called Data Analysis.
01:54I won't go to all the details of how to run it, or its output, but what I will
01:58do is show you how to access it.
02:01First thing you do is click any cell in the group of data that you want to
02:04analyze--in this case the Excel table--and then click Data Analysis.
02:09In the Data Analysis dialog box, you can select whatever tool that you want to use.
02:15So, for example, if you click Descriptive Statistics and click OK, Excel
02:20displays the Descriptive Statistics dialog box.
02:22I will stop here for now, but I just wanted to show you how to install the
02:27Analysis ToolPak Add-In, and also how to run it.
02:30The Analysis ToolPak Add-In extends Excel's capabilities significantly.
02:35You'll use its functions several times in this course, so you should install it
02:38before you start performing your analysis.
Collapse this transcript
Auditing formulas step by step
00:00Excel formulas, particularly those that combine values from more than two or
00:03three cells, can be difficult to comprehend when you encounter them.
00:07If you find a formula you don't fully understand the first time you see it,
00:11you can use the Excel Formula Auditing tools to step through it and get more information.
00:16So let's say that you have a worksheet that is creating a series of
00:21seasonal forecasts.
00:23You'll see this worksheet later in the course, so don't worry about the details
00:26for now. But what I want to do is show you how to analyze an individual formula
00:31within this worksheet.
00:32For example, let's take a look at the value and the formula in cell C10.
00:39The formula is displayed here in the formula bar, and it's relatively simple.
00:42It multiplies the value in B6 by 0.67 and adds that to the value in cell
00:48C6 multiplied by 0.33.
00:50But let's say that, for whatever reason, you're having a hard time getting your
00:53head around that formula.
00:54There are some things that you can do Excel to give you more information, and allow
00:59you to analyze that formula.
01:01To do that, you click the Formulas tab, and then in the Formula Auditing group,
01:06you can click Trace Precedents.
01:10Clicking Trace Precedents while you have a cell with a formula selected draws
01:14arrows from the cells that provide values for that formula; in this case, cells
01:19B6 and C6 provide values for the formula in the cell C10.
01:25If you want to remove those arrows, you can click the Remove Arrows button.
01:30Similarly, if you want to find the cells that use a particular cell in their own
01:35formulas, then you could click, say, cell B6 and then click Trace Dependents.
01:44In this case, the value in cell B6 is only used in the formula in cell C10.
01:50And again, you can click the Remove Arrows button to get rid of the tracer arrows.
01:56Now let's click cell C10 again and evaluate the formula in cell C10 one step at a time.
02:04To do that, again, on the Formulas tab, with these cell with the formula you want
02:08to examine selected, click Evaluate Formula.
02:12The Evaluate Formula dialog box opens, and you can see the formula as it is spelled out.
02:18Now what's going to happen when we click the Evaluate button is that Excel is
02:23going to evaluate the next term in the equation.
02:26In this case, it's underlined, and that is the value in cell B6.
02:31So when we click Evaluate, or when you press Enter because that button is
02:35highlighted, Excel changes the value to 067 multiplied by 61,875. And again, that
02:43value came from cell B6.
02:46When we click Evaluate again, Excel will evaluate the underlined term 0.67
02:52multiplied by that number. Click Evaluate and you get the result.
02:57Next, it's going to evaluate C6, which has 52,937.
03:01So when we click Evaluate, that number appears there. And now the formula, we'll
03:07multiply those two values together. And finally, Excel is going to add the
03:12values 17,000 and change plus 41,000 and change. Click Evaluate and you get your final answer.
03:20If you want to go through the formula again step by step, you can click Restart
03:24to go back to the beginning.
03:26When you're done, click Close.
03:29The more time you spend acquainting yourself with the formulas on the workbook,
03:32the sooner you'll be able to manage that workbook effectively.
03:35Excel 2007's Formula Auditing tools can provide powerful insights into what might
03:40seem like impenetrable logic.
Collapse this transcript
3. Summarizing Data Using Tables and Graphics
Creating frequency distributions for qualitative data
00:00When you think of data you store in an Excel worksheet, you usually think of
00:03numbers such as item costs, the total of a customer's order, or your electric bill.
00:08In some cases though, you will work with qualitative data, such as the name of a
00:12product, a state, or a city.
00:15In this movie, I'll show you how to summarize occurrences of qualitative values
00:19using Excel's built-in formulas and functions.
00:23So I have an Excel table filled with data. And in this case we have order numbers,
00:28and each individual order was for a single type of olive oil sold by the Two
00:34Trees Olive Oil Company.
00:36So I have lemon, garlic, jalapeno, orange, and so on.
00:39If you want to count the number of times each of those infusions was ordered,
00:43you need to do two things:
00:45The first thing you need to do is to identify all of the infusions that were
00:48ordered as listed in the Excel table.
00:51Second, you need to count the occurrences of each value within the list.
00:56To find all of the different values in column B, you go to the Data tab and then
01:02click Advanced Filter.
01:04When you click Advanced Filter the dialog box of the same name appears.
01:08The action that you want to take is to copy the results of the filter to another
01:12location, and we'll fill in our target value here. And Copy to,
01:17let's put it starting in cell D1.
01:19Now the List range is currently listed as A1 through B29--
01:26that's the entire table; instead, we should make it B1 through B29. That way we
01:34only get the values in column B, the Infusion column.
01:38If we were to leave it as A1, then every row would be different, so the filter
01:43would have no effect.
01:44In other words, because each value in the Order Number column is different, then
01:48every row in the table would, by necessity, be different, or be unique.
01:53Then we want to check the Unique records only check box.
01:58What that will do is it will find each of the different values in the range B1 through B9.
02:05When we click OK, Excel displays the list.
02:09Now, Infusion is the header,
02:10so I'll just format that as a header by pressing Ctrl+B to make it bold. And now
02:16I can create formulas that will count the occurrence of lemon, garlic, orange,
02:21jalapeno, and so on in my range.
02:25To do that, we will use a COUNTIF formula.
02:28So I will type = countif(, and now I identify the range.
02:35The range is the area of the Excel table that includes the data that I want to look at,
02:41so that is the Infusion column.
02:44So I will move my mouse pointer over the top of the column header in the table,
02:50and you'll see that my mouse pointer changed to a downward-pointing black arrow.
02:54When I click Excel identifies the table, Table1, and the column.
02:59Now I can type a comma and enter in the criteria.
03:03The value that I want to look for is in cell D2.
03:05In this case, that's lemon. So I'll type D2.
03:09Again, I'm looking the value by reference instead of by naming it directly.
03:13You'll see why that's an advantage in a second.
03:16I'll type a right parenthesis to end the formula. It looks correct.
03:20When I add the formula, you see that we have nine occurrences.
03:24Now what I'll do is I will copy the formula down by clicking the cell and then
03:29grabbing the fill handle at the bottom-right corner, and drag it down.
03:35When I do Excel extends the formula, and it does it in this way.
03:39Remember that the formula in the first cell is COUNTIF, Table1 Infusion, D2.
03:43D2 is the criteria.
03:48When I copied the formula down, Excel took advantage of the fact that this
03:52reference here, D2, is a relative reference.
03:57Copying the formula down one row caused the cell reference at D2 to change to
04:03D3, and then in this cell it's D4, and in this cell it's D5.
04:09Now with this data you can create a column chart.
04:13To do that, select the data that you want to display in the chart, click Insert,
04:19and then click the type of column chart you want to create.
04:23Doing so creates a chart with a visual representation of the data that you just created.
04:28You can use the COUNTIF function to find and count text values in the dataset.
04:33By naming the text you want to find in the formula's criteria, you can summarize
04:37qualitative values using numbers that you can analyze using traditional methods.
Collapse this transcript
Summarizing data using bar graphs and pie charts
00:00Running a company generates a lot of data, often more than you can easily
00:03comprehend by looking at the individual values.
00:06If you want to summarize your data visually, you can create a chart.
00:10In this movie, I'll show you how to create three common charts: the column chart,
00:15a line chart, and a pie chart.
00:18You create a column chart to summarize category data.
00:21So, for example, let's say that you were selling olive oil through
00:25three different vectors, and that would be the web, in your store, and also by catalog.
00:31If you wanted to create a column chart, you can click any cell in your Excel
00:35table or in your data range--
00:38it doesn't have to be a table--
00:39then on the Insert tab in the Charts group, click the type of chart that you want to create--
00:45in this case, a column chart-- and then click the subtype.
00:49We'll just click a regular Clustered Column, 2-dimensional chart.
00:53When you do, Excel creates a chart with the Amount as the title, and Amount
00:59corresponds here to the column header for the data.
01:03You can also see the axis here along this side, indicating the relative values,
01:08and then the three different channels here: Web, Store, and Catalog.
01:13If you have data that's spread out over a time series, you can create a line chart.
01:18So I'll click the Line Chart sheet tab and here you see I have data for the
01:24years 2007 through 2010, and again we are looking at Sales.
01:29To create a line chart, you click any cell in the data collection Insert > click
01:35Line, and we'll just click the basic 2-D line chart.
01:40Now, notice that the chart doesn't look as we expected.
01:43Instead of having the years 2007, 2009, and 2009, 2010 along the bottom, we
01:49have 1, 2, 3, and 4.
01:51We have a second line here at the bottom, this blue line called Year, which
01:55corresponds to value 2007 through 2010.
01:59So what happened was that Excel misinterpreted the data in the year column
02:03and thought that it should be plotted as a separate series instead of using it as the column,
02:09so you need to adjust the chart.
02:11To do that, with the chart selected on the Design contextual tab, click Select Data.
02:17Now you can indicate which data that you want to work as that horizontal or
02:22category values. But first, we can get rid of Year from the Legend Entries--and
02:28those are the series of data that provide data for the lines.
02:32So you can click Year and then click Remove.
02:36Now in the Horizontal or Category Axis Labels group, click the Edit button, and
02:41then you need to identify which values you want to use.
02:45So select the range of A2 through A5, click OK, and you see you get your correct
02:53values: 2007 through 2010.
02:57Now when you click OK, the chart looks as expected.
03:00Finally, I'm going to show you how to create a pie chart.
03:04A pie chart is used to summarize proportion.
03:07So, for example, if you had sales for four different varieties--lemon, mandarin,
03:12rosemary, and jalapeno--you can create a pie chart to see the relative
03:16contribution of each of those infusions.
03:19To do that, you click any cell in the data list or table, and then on the Insert
03:24tab click Pie and then click the type of chart you want to create.
03:28In this case, just a regular 2-D pie chart.
03:31Doing so creates a chart, and you can see that lemon, mandarin, rosemary, and
03:37jalapeno are all represented in the body of the chart by their color, and the
03:42size of the wedge indicates their relative contribution.
03:45So, for example, because the mandarin value is 2,500, we would expect that to
03:50have the largest pie wedge.
03:52Mandarin is red. And sure enough, it does have the largest wedge.
03:56If you hover your mouse pointer over one of the wedges, it gives you information
03:59about the data that it summarizes.
04:02The series is called Sales, the data point is Mandarin, and the value is 2,500,
04:08and it also indicates the proportion of the whole that it represents--in this case, 36%.
04:15Yes, what you just saw was not an optical illusion: after awhile the tooltip that
04:19appears does disappear after about 10 or 15 seconds.
04:24If you hover your mouse pointer over another wedge, then you'll get a
04:27similar display for there.
04:29Charts help you summarize your data visually, allowing your audience to grasp
04:33your data's high-level trends and patterns.
04:35After you summarize your data at this level, you can break it down into
04:38smaller chunks, examining various aspects of your operation using charts or
04:42the raw numbers.
Collapse this transcript
Creating a frequency distribution using the FREQUENCY function
00:00When your business generates a large data set, such as the details for thousands
00:04of orders, you don't always want to summarize your data by value.
00:07In other words, you probably don't need to count how many wholesale customers
00:11ordered one case of olive oil, two cases, three cases, and so on.
00:14Rather than count the occurrences of each value, you can summarize your data set
00:19by dividing values into bins.
00:21For example, you could create bins for orders of up to 10, between 11 and 20
00:27cases, 21 to 30, 31 to 40, and so on.
00:32To create your frequency formula in a column anywhere on the worksheet,
00:36for ease of use I'll do it right next to the Bins. Select one more cell
00:41than there are bins.
00:42So in other words, we have four bins listed, so I'm going to select five cells.
00:48It's very important that those cells remain selected while you're typing in the formula.
00:52Now with those cells selected, you can create the frequency formula.
00:56To do that, type =FREQUENCY and then your data array.
01:01In this worksheet, my Excel table is called a CaseCount.
01:06So I will type in Ca, and when CaseCount is highlighted in the AutoComplete
01:11list, I will press Tab.
01:13I want to use the values in the Cases column.
01:16So I'll just type C for Cases. It appears.
01:19Press Tab and then a right square bracket to close the table reference.
01:24Now I'll type a Tab to move onto the bins array and the bins occur in cells D2 through D5.
01:30The formula looks correct.
01:34I will type a right parenthesis.
01:36Now, instead of pressing Enter, I'm going to hold down the Ctrl and Shift
01:41keys and press Enter.
01:43This is called an array formula.
01:46Another Excel commentator calls these formulas CSE formulas.
01:51The reason he does that is because you hold down the Ctrl key, the Shift key,
01:54and then press Enter.
01:56Excel creates an array of formulas.
01:59You can tell they are in an array, because in the formula bar you can see that the
02:03formula is outlined in curly brackets.
02:07So that's how you know that you created an array formula.
02:10If you click an individual cell, such as here, you'll see the same formula, here, and so on.
02:16What the array formula is doing is referencing the bin in the same relative
02:21position in its column as the cell that contains the formula.
02:25So, for example, here for 36, it is counting the number of orders of cases up to 10.
02:32Then 32 is the count of orders up to 20.
02:3437 is the count of orders up to 30.
02:3735 is the count of orders up to 40.
02:41Then 60 is the number of orders of more than 40 cases.
02:45Now one column to the right, I'm going to show you what happens if you create
02:49the formula incorrectly just by pressing Enter,
02:51so you'll be able to recognize what happened, just in case you get a wrong.
02:55So I will select cells F2 through F6, =FREQUENCY, and I'll just go through this
03:02quickly, CaseCount and then Cases, comma, and the bins are still in D2 through D5.
03:10And instead of pressing Ctrl+ Shift+Enter, I'll just press Enter.
03:15When I do, Excel creates a formula, but it only looks at the first bin.
03:20It doesn't create the formula for the rest of these cells.
03:24So there are two things to look for when you create an array formula:
03:27the first is that Excel fills the entire array of selected cells with the
03:32formula and secondly the formula is surrounded by curly brackets.
03:37That's the indicator that you have created an array formula.
03:41Now that you have your data, you can use it to create a chart or whatever you like.
03:45The FREQUENCY function lets you summarize your data by dividing your values into
03:49bins and counting the number values that fall into each bin.
03:52These summaries help make sense of large data sets without forcing you to create
03:56filters or group your data using more cumbersome procedures.
Collapse this transcript
Creating a histogram
00:00In the previous movie, I showed you how to use Excel's FREQUENCY function to
00:04divide your data into bins and to count the number of values that fell into each bin.
00:08In this movie, I'll show you how to create a chart called a histogram, which
00:12summarizes that bin information visually.
00:15To create a histogram, you need to use the Analysis ToolPak.
00:18If the Data Analysis button doesn't appear on the Data tab of the Ribbon at the
00:23far right side, then you need to install the add-in.
00:27I explained how to install the add-in in the movie in Chapter01.
00:30To begin click of the Data Analysis button--
00:33again, that's on the Data tab--and in the Data Analysis dialog box, click
00:39Histogram and then click OK.
00:42In the Histogram dialog box, select the Input Range, which in this case is here in the table.
00:50So click the Contract dialog button, and then in the worksheet, I'll just use my
00:55scroll wheel to move up one row.
00:59You want the data in the Cases column of the table, so move your mouse pointer
01:04over the column header, which says Cases. And when the pointer turns into a
01:08downward-pointing black arrow, click, and Excel selects B2 through B201.
01:15Click the Expand dialog button and then click Bin Range.
01:19Now something you need to watch out for is that you don't want to select the
01:24column label for Bins here, D2 through D5, because what you did when you selected
01:31this column in the table was you selected B2 through B201, but Excel did not
01:36select D1, which contains the column label.
01:40So for the Bin Range, you want to select D2 through D5.
01:46When you select those, Excel gives you that range, and you don't have to worry
01:50about the column label being included.
01:53If you did select data and included column labels in it, then you would
01:57check the Labels box.
01:59We didn't in this case, so we don't have to.
02:02I can also select the Chart Output box.
02:06Selecting that box will cause Excel to create a histogram chart in addition to
02:11the bin frequency distribution data.
02:14I do want to put the output on a New Worksheet. Everything looks good,
02:17so I'll click OK and have Excel create my worksheet.
02:21Excel summarizes the data by bin.
02:24The first one contains orders up to 10 cases.
02:27Next one is 11 to 20, 21 to 30, 31 to 40, and finally, more than 40.
02:34It summarizes that data in this histogram, which is a column chart broken down by category.
02:40Histograms summarizes your data visually by creating a chart that depicts how the
02:43values fall into data ranges, or bins, that you define.
02:47When the ranges of your values matter more than the individual values for a
02:50particular analysis, use a histogram to summarize that data.
Collapse this transcript
Calculating a running total
00:00When you summarize data in an Excel table, you can find totals, averages, and
00:04counts of data in each column.
00:06One feature that's not built-in, however, is the ability to calculate the
00:10cumulative percentage that data contributes to the total.
00:13In this movie, I'll show you how to calculate a running total and also a
00:17running contribution.
00:19In the table on the screen right now, the third column contains the total of the
00:23values in the current row and all the rows above it.
00:26So, for example, in cell D3, we have the value 327, and the formula--which I'm
00:32not going to show you yet--calculates the total of all the cells in column B in
00:38the same row and above.
00:40So, for example, the first cell contains the value 199, and the second contains
00:45the value 327, which is 199+128 and so on, down the line.
00:52The formulas in column D--which again, I'm not going to show you yet--
00:56calculate the percentage that this running total is of the total of all
01:01values in column B.
01:02Now I am going to delete these formulas and show you how to create them.
01:10First thing you do is click cell C2-- that is the first cell in the Running Total
01:15column--and you type this formula: "=sum(B2", and then press F4. Pressing F4
01:28changes the relative cell reference, B2, to the absolute reference of B2. And what
01:34that means is that--well first, functionally it adds Dollar signs in front of the
01:38B and in front of the 2.
01:40The B is the column reference, the 2 is the row reference, and the Dollar sign
01:45means that that reference will not change when you copy it to another cell.
01:50And I'm going to do that in a moment.
01:52So we'll just see how it works, because the sum is of the range from B2 to B2.
02:01And then I'll type a right parenthesis.
02:02Now what's going to happen is this reference, the first B2, is going to stay the
02:07same regardless of where the formula is copied, and the second B2 will vary.
02:13So, in other words, when the formula gets copied to this cell it will be B3 here,
02:17here it will be B4, and so on, because you're moving down one row at a time, and
02:22that means the row number will increase.
02:25When I press Enter, Excel fills the entire table column with formulas.
02:30You can see here that the second data column contains =SUM($B2:$B3), ($B2:$B4), ($B2:$B5), and so on.
02:41The error indicator is there because your formula omits adjacent cells.
02:48In other words, what it saying is that you have numerical values in every cell
02:52below B4, so it thinks that you might want to include them in your total.
02:58You don't, and I'll show you how to turn that error indicator off at the end of the movie.
03:02Now let's do the same thing for a running contribution, in other words the
03:06percentage of the total--which is 2870-- that in individual cell value--199, 327, or
03:14so on--contributes to that total.
03:18To do that, type "=C2"--and this time we are going to leave the reference
03:24relative, so it will change--divided by the sum of the Amount column's values.
03:32So the name of the table is OrderSummary, so I'll type that and then accept the
03:38AutoComplete offering. Left square bracket. And we want to use the values in the
03:44Amount column, so I'll type an A and then Tab to except the formula
03:49AutoComplete. Right square bracket. Right parenthesis to close. Press Return
03:55and we get our cumulative percentage.
03:58So 199 is 6.93% of 2,870, 327 is 11.39% of 2,870, and so on.
04:09Now I'll show you how to get rid of the error indicators.
04:11There are number of ways to do it, but in this context the quickest way is to
04:15select any cell that displays that error--and the indicator is the green
04:19triangle in the upper left-hand corner-- click the Error Options button that
04:23appears, and then click Error Checking Options.
04:28On the Formulas tab, you can go down to the Error Checking Rules area and clear
04:33any check box next to an error that you don't want to display.
04:37In this case, we don't want to display an error for formulas which omit cells in a region,
04:43so we'll clear that check box,
04:45click OK, and the indicators disappear.
04:48Finding a running total and running percentage help you visualize how your
04:51revenue accumulates over time.
04:53If you operate a retail business, you might find that the first 10 months of
04:57the year contribute about half of your revenue, with the remaining half
05:01generated during the last two months of the year, which is the traditional
05:04gift giving season.
Collapse this transcript
Summarizing data using a PivotTable
00:00You can use Excel to organize your data in many different ways, but the standard
00:04Excel worksheet doesn't let you reorganize your data on the fly.
00:07However, Excel does have a built-in tool that lets you change your data's
00:11organization and summary dynamically.
00:13That tool is called a PivotTable.
00:16To create a PivotTable, first you need a data list, such as an Excel table.
00:21The Excel table I have here has the data for year, month, sales channels--such
00:25as web, catalog, or store--and then the sales for the particular vector during a month.
00:32To create a PivotTable, you select any cell in your data list, and then on the
00:38Insert tab, click the PivotTable button.
00:42In the Create PivotTable dialog box, you make sure that Excel has correctly
00:47identified the table or range that you wanted to use to create the PivotTable--in
00:52this case it's Table1, which is outlined here--and then you choose where you want
00:56the PivotTable to be placed.
00:58I always place my PivotTables on a new worksheet.
01:01That way the PivotTable isn't competing with the original data.
01:05Now if you had an existing worksheet that had some charts or some other
01:09information that you used as part of a dashboard, then you could certainly
01:12create your PivotTable there.
01:14But for the most part, because PivotTables can get kind of large, you will want
01:18to put it on its own worksheet.
01:20Everything looks good, so I will click OK, and Excel creates the PivotTable on a new worksheet.
01:26Now, I can add fields to the PivotTable.
01:28So let's say, for example, that I just wanted to summarize sales for each year.
01:32I'll put those in the Row Labels area.
01:36When I do, Excel updates the PivotTable
01:39so that the row headers are all of the years that are available within the data
01:43set, and now I can add that the sales channel to the Column Labels area.
01:50So I have 2008, 2009, 2010;
01:53Store, Web, and Catalog. And finally, I can add the Sales values to the Values area.
02:02So in 2008 I had over $61,000 in catalog sales, $82,000 in in-store sales, and
02:10over $143,000 in web sales.
02:14I also have a grand total for each of the years: 287,000, 277, 285, and so on.
02:20If I add another layer of organization within the PivotTable--such as by adding
02:25a month, which I'd do by dragging it from the Field List area to the Row Labels
02:30area, underneath Year--you will see that now 2008 has individual results for
02:36January, February, and so on.
02:38And the grand total is here at the top, 287,872. That's the Grand Total for 2008;
02:452009, 277; and so on.
02:50And then I also have the total for each month.
02:53So in January we had these values for catalog sales, store and web, and they add
02:58up to this total, and the same for the other 12 months.
03:01If I scroll down to the bottom of the PivotTable, we will see that we have
03:06grand totals for all of the columns Catalog, Store, and Web, and also a grand
03:12total of those values.
03:13If you want, you can hide the subtotals and grand totals or move them around.
03:17To do that, you click any cell in the PivotTable, and then on the Design
03:21Contextual tab, you use the tools in the Layout group.
03:25So, for example, if you wanted to hide your subtotals, you can click Do Not
03:29Show Subtotals. And I personally prefer to show subtotals at the bottom of a
03:34group, as opposed to the top.
03:35So that's one change I always make.
03:37I always click Show all Subtotals at the Bottom of the Group.
03:42So now instead of having 2010 and the subtotals here, I have the 2010 subtotals
03:50at the bottom and the grand totals were in the usual place.
03:54If you want, you can turn grand totals, which are at the end of a column or
03:59the end of a row, on or off. Or you can turn them on for rows or on for columns only.
04:04So, for example, if I only want the totals for the columns, I can click On for
04:09Columns Only, and the grand totals at the right edge--which would summarize each
04:14month or each year--disappear. But I still have my grand totals at the bottom,
04:20which are for my three different sales vectors.
04:22Now I am going to take the Month field out of the PivotTable again, by dragging
04:27it from the Row Labels area to the Fields area. And I'll scroll back up.
04:32Right now, I'm summarizing the data in my PivotTable using the SUM function;
04:36in other words, I am finding the arithmetic sum of all the values within each of
04:40these categories: catalog sales for 2008, 2009, 2010, and so on.
04:45If I want, I can add a second instance of the sales values to the Values field by
04:51dragging Sales down to the Values area. And you see I have Sum of Sales1 and Sum
04:58of Sales2, and that organization is reflected here in the Values box.
05:05What I'll do is change the summary operation for Sum of Sales2, so that instead
05:10of the arithmetic sum--the total--we're going to have the percent contribution to
05:15a column or to a row.
05:17So I will click Sum of Sales2 and click Value Field Settings.
05:24In the Value Field Settings dialog box, I'll click Show values as. And then in
05:28the Show values as list, I will click % of row.
05:35Now when I click OK, Excel indicates that the first value is 21% of the total,
05:41the second is 28.59% of the total, and the third is 49.96% of the total.
05:49In other words, your web sales--which were $143,000--contributed almost 50% of the total value.
05:57Now those values are little but hard to read because there are no commas, so I
06:00am going to change the Number format for the Sum of Sales field.
06:04To do that, I will go down to Sum of Sales, click the down arrow, and click
06:09Value Field Settings.
06:11Now I'll click Number Format, and we can use the Format Cells dialog box's
06:16controls to change the format.
06:18In this case, I will change it to Currency. And I don't have any Decimal places--
06:23no sense, in other words--I will change Decimal places to 0. Click OK. Click OK
06:30again, and my values have changed, and now they are easier to read.
06:34PivotTables give you enormous flexibility when you analyze your data.
06:39If you take the time to learn how to use them effectively, you'll find that you
06:42can develop useful insights into your data very quickly.
06:45For more information on analyzing data with PivotTables, see my course "Excel 2007:
06:51Pivot Tables for Data Analysis" elsewhere in the lynda.com Online
06:55Learning Library.
Collapse this transcript
4. Describing Data Using Numerical Methods
Calculating mean, median, and mode
00:00When you analyze a data set, you can often gather a great deal of information
00:04about the data by finding its central values.
00:07In this movie, I'll show you how to calculate three different types of central
00:10values: the mean, the median, and the mode.
00:14The common word for the arithmetic mean of a data set is the average.
00:18It's the sum of the values divided by the number values.
00:22So, for example, in this table here, I have seven days and then the sales
00:27totals for those days.
00:29If I want find the average in cell B1, I can type "=average(" and then the name of
00:37the table, which is DailySales. So I will start typing it and then press Tab to
00:43accept the formula AutoComplete value.
00:46Now I want the Sales column, so I'll type a left square bracket, start typing
00:51Sales, Sales is highlighted,
00:53Tab to accept it, right square bracket, and then a right parenthesis to close the formula.
00:59When I press Enter, the value appears: 3,825.9. Averages can be misleading if you
01:07have one or two disproportionately large values in a data set, such as days
01:11where customers made unusually big orders.
01:14So let's say, for example, that on March 1, instead of having $5,194 in sales
01:19that a customer had ordered enough to push the total up to $20,000.
01:24When you press Enter, you see that the mean has gone up to almost $6,000, when all
01:30of the other values except for that one 20,000 are still in the $3,000-$4,000
01:36range for the most part.
01:37So always be careful when you're examining your data to make sure that you don't
01:43include any really large values, anything that doesn't seem to fit with the rest
01:47of the data pattern.
01:48Now if those values occur a couple of times, in your estimation, well it's not
01:52unusual to have days where there are huge orders, then certainly include them;
01:56that's part of the business knowledge that you have about your operation. But if
02:00it's just one instance, then don't worry about it; you should exclude it.
02:04So I guess the question is how do you do that?
02:06Well you do that by filtering your value out of the table.
02:11So, for example, if you want to create a filter, you could click the Sales
02:14column's Filter arrow, click Number Filters that are Less Than, and let's say
02:2010,000 as a reasonable upper bound.
02:23When we click OK, Excel filters the table, but you'll notice that the
02:29mean didn't change;
02:30it still looks at all the values.
02:31What we can do, however, is add a total row at the bottom of the table, and that
02:35will display the average of only the visible values.
02:39To do that, you click any cell in the table, and then on the Design Contextual
02:43tab, check the Total Row box.
02:47When you do that, it displays a total, but if you click the cell in the Total
02:52Row, click the down arrow, and then click the AVERAGE function, Excel displays a
02:58value of 3,598, which is the average of only the visible values.
03:02If you remove the filter, by clicking the Filter arrow and then clicking Clear
03:06Filter from Sales, then Excel's AVERAGE function here in the Total Row gives the
03:11same answer as the formula in cell B1.
03:14Next I would like to talk about the median value.
03:17The median value of a data set is the value in the middle of the values if they
03:21were sorted in ascending order-- that is, from lowest to highest.
03:25To discover the median in the data set, you use the MEDIAN function, so you type
03:32"=MEDIAN(" and then the name of the table, which is CustomerCount,
03:38and we are using the data in the Customers field, or column.
03:44Press tab to accept the field name. Right square bracket. Right parenthesis.
03:49Everything looks right. And we get the median value of 116.
03:54Now the values aren't sorted in the table by customers, so to do that--to
03:58make sure that 116 is the proper value--I will sort these values from
04:03smallest to largest. And you see that the value that appears in the middle is
04:07116, which is the median.
04:10The median value is not affected by extremely large or extremely small values.
04:15So, for example, if I were to change the value 130 to 500, or even 5,000, and
04:23press Return, the median doesn't change because 116 is still in the middle.
04:27The reason that happened is because 5,000 is still in the same half, if you were,
04:33of the data set. So we have 49, 51, to 74 that are all less than 116, and 123,
04:39126, and 5,000 that are all greater than 116.
04:43If I were to change 5,000 to 1 and press Return, it would move to a different
04:47half of the data set, and you'll notice that 74 is the new median, which we can
04:53verify by resorting the data from smallest to largest. And you see that 74 is
04:59now in the median position.
05:02Finally, I would like to talk about the mode.
05:04The mode of a data collection is the most common value.
05:08In Excel 2007, if multiple values are tied for the highest frequency, the program
05:12picks the lowest value as the mode.
05:15So, for example, in the data set here, if I want to find the most common
05:18value in the Customers column, type "=MODE(" and then the name of the table,
05:26which is "HourlyCustomers," and we're looking in the Customers column, using
05:33that data. Right square bracket. Right parenthesis to close. Return. And the
05:39MODE formula returns a value 28.
05:41And if we sort this data like we did last time, sorting from smallest to largest,
05:46then we'll see that indeed the number 28 occurs three separate times.
05:50The number 14 occurs twice. And because it's lower than 28, if I were to change
05:54one of the 28s to another value that doesn't occur--let's say 30, which doesn't
05:59appear anywhere else--and hit Return, then the mode sees that there are two
06:03occurrences of 28, two occurrences of 14, and using its own rules, it selects the
06:09lower value, 14, as the mode.
06:12Calculating modes is only useful for what's called "discrete data," where the
06:16values are whole numbers such as 10, 25, or 50.
06:20Continuous data, where the measurements could have decimal components,
06:24either won't have the most common value or there will be a duplication
06:28that occurs entirely by chance.
06:30Most of these statistical techniques you will use in this course rely heavily on
06:34the mean, or average, value of a data set. And unless your data set contains one
06:39or two extremely large or extremely small values that distort the average, it's
06:43the most useful value.
06:44If you're interested in the value that would be in the middle of a range if
06:47you sorted the values in ascending or descending order, calculate the median.
06:52Calculating the mode, the most common value in a data set, could point out the
06:55most common answer in a survey and help you discover information about your
06:59customers and their opinions about your business.
Collapse this transcript
Calculating a moving average
00:00When you track data over time, you might want to find the average of the last three values.
00:04For example, one measure of your company's recent performance might be its
00:08average sales over the past three months.
00:10You can calculate a moving average in a worksheet by creating a series of
00:14formulas by hand, or by using a tool in the Data Analysis ToolPak.
00:18To use the tool in the Data Analysis ToolPak, you select the Data tab on the
00:23Ribbon and then click Data Analysis.
00:27If Data Analysis doesn't appear, then you need to install the Analysis ToolPak,
00:33which is available as an add-in.
00:35I showed you how to install that add-in in a movie in chapter 1.
00:40With it installed, click Data Analysis, and then in the dialog box that
00:44appears, scroll down and click Moving Average. Click OK and the Moving Average
00:51dialog box appears.
00:53Now you can select your Input Range.
00:54These are the values that Excel will use to calculate the moving average.
00:58In this case, I want to use cells C2 through C13, and I did not select the Sales label,
01:07so I don't need to check that box.
01:09Next, I need to set the Interval.
01:12The interval is the number values that you want to include in your moving average.
01:17In this case, I will click 3.
01:19That means that the moving average that will be calculated for these three cells
01:25will appear in cell D4.
01:27The moving average for these three cells, C6 through C4, will be displayed
01:32in cell D6, and so on.
01:34Now I'll select my Output Range, and those are the cells here, D2 through D13.
01:44I don't want a chart.
01:45I don't have to worry about standard errors.
01:47We will cover those later in the course. Now I will click OK, and Excel adds a
01:51new column to my table, and it also displays the moving averages.
01:55The first thing I'll do is change the name of the table, Column.
01:59To do that, you click the cell where the table column header appears, and I will
02:04call it "Moving Average," and double- click the right edge of the column header to
02:13expand it, so it displays the entire contents of the cell.
02:16Now we can look at the moving average formulas.
02:18The first two cells in the column, because I had an interval of three, do not
02:23have a value. That's because there aren't three values that you could use to
02:27calculate your moving average. But the value in D4 is a moving average, and what
02:32Excel did was it created the formula =AVERAGE C2 to C4.
02:37So it finds the average of these cells and puts that value here. And it
02:43continues on, C3 through C5, C4 through C6, and so on.
02:49Moving averages offer great insights into your data, but you should adjust the
02:54number of values you include based on your data set's characteristics.
02:57For a call center that receives thousands of calls a day, you might take an
03:01average of the last seven days. For a smaller company that receives five or ten
03:05outside orders a day, perhaps a moving average based on monthly the data would
03:08be more appropriate.
Collapse this transcript
Calculating a seasonal forecast
00:00Many businesses' sales patterns follow predictable patterns based on the time of year.
00:05Garden stores tend to sell the most plants during the spring, and toy stores
00:08tend to sell a huge percentage of their yearly total during the winter gift-giving season.
00:14If you believe your data follows a seasonal trend, you can create a series of
00:18formulas to forecast your company's sales based on those patterns.
00:21In this example, I have quarterly data, and because it's broken up by quarter,
00:26there will possibly be a pattern that would apply based on each quarter's
00:30positioning within the year.
00:32We will be comparing Q1 to Q1; for example, Q1 2006 to Q1 2007,
00:37do the same for Q2 and Q2 and so on. But if your data tracks calls to a
00:43call center, you could also compare Monday to Monday, Tuesday to Tuesday, and so on.
00:47There are many ways to estimate the effect of seasons on corporate revenue:
00:51one way to do it is to create a forecast based on the actual sales and forecast
00:55from the last two quarters.
00:57There's no precise way to decide what way you should get to the two factors, but
01:01a reasonable way to start is to multiply the sales from the quarter a year ago
01:06by two thirds and the sales from the same quarter two years ago by one third;
01:10that way you're giving more weight to recent performance, but you're not
01:13discounting what has happened earlier.
01:15For the first four quarters, which would be cells C2 through C5, there can be no
01:21forecast because there is no previous data.
01:24For the second set of quarters, the quarters for the second year, you only have
01:28the previous sales to go on, so you create that formula.
01:32We have equal =B2. And again, this is Q1 2007, and the only applicable data you
01:39have us from Q1 2006, so that's how you create your formula.
01:44If you copy that formula down, by dragging the fill handle, until it covers
01:48Q4, then you see that you've repeated the values from the previous quarters a year ago.
01:55Now starting with your third year, you can begin to use your weighted formula.
02:00To do that, you type equal, and then we're taking the value from two years ago--
02:05same quarter--multiplied by one third. So that would be B2 times 0.33, plus the
02:12value from the same quarter one year ago, and that is the cell B6, outlined in
02:18green, and we will multiply that by 0.67.
02:22When we press Return, we get a forecast of 58,925, which is reasonably close to
02:28the actual sales of 63,822.
02:32Now if we copy the formula down, Excel extends the formula and uses the values
02:40from four rows above and from eight rows above.
02:43So, for example, here for Q2 2008, we're looking at cell B7, which is Q2
02:50of 2007, and B3, which is Q2 of 2006, and the value 61,000--reasonably close. Not great.
03:00Then we have 58,000 and 70,000 as forecasted, and again, there will be variations in
03:05business cycles, and so on.
03:07What you're looking for over time is a formula that best approximates your
03:12business's performance, and there's no real way to tell you what it's going to be the start;
03:16you will just have to find it on your own.
03:18The data in this table presents a challenge because the quarters for the first
03:21three years show little growth.
03:23Seasonal forecasts start working well when you have at least five years of data
03:27to draw on, but keep adjusting the formula's weights as times goes by to see if you
03:32can find a good fit.
Collapse this transcript
Calculating percentiles and quartiles
00:00Large data sets often contain a wide range of values, so it can be useful to
00:04find the top and bottom percentages of the data.
00:07For example, if you track sales data, you might want to discover the value that
00:11delineates the top 20% of your sales days from the bottom 80%;
00:15alternatively, you might want to find the value at each quarter, 25%, 50%, and 75%.
00:22These values are called quartiles because they divide the data set into four quarters.
00:28To calculate a percentile in Excel 2007, you use the PERCENTILE function.
00:34So, for example, if you wanted to find the 50th percentile--that is the value
00:38where half the data set is above the value and half the data set is below--you
00:43would create the formula =PERCENTILE, and then you need to name the array. And
00:50the array in this case is called Table1-- it's the Excel table at the left side of
00:56the screen. And we are looking at the data in the Sales column,
01:00so I will type S and accept Sales from the formula AutoComplete list.
01:04Type a right square bracket and then a comma.
01:07The second argument in the PERCENTILE function is K, and K is simply the percentile.
01:13So I'm looking for 50%. Type in 50%. Right parenthesis to close the formula, and
01:21you see that the value 1,185 is the value where half the data set as above and half
01:27the data set is below.
01:29If you wanted to calculate the 90th percentile, you would create a similar
01:32formula: =PERCENTILE, Table1, Sales, 90%, right parenthesis. Everything looks
01:44good. Hit Return and that value is 2,167.
01:49So 10% of the values in the data set will be 2,167 or above. 90% will be
01:55less than that amount.
01:56If you want to see the exact values, that represent, say the top 10% of a data
02:02set, you can create what's called a top 10 filter.
02:05To do that, you go into your Excel table and click the Filter arrow of the
02:10column that you want to filter--in this case it's Sales data--point to Number
02:15Filters, and click Top 10.
02:18In this case, we are looking for the top 10%, so top is correct, 10 is correct,
02:23but we need to change items 2%.
02:27When we do, click OK, and Excel displays the values that represent the top 10%
02:32of values from this data set.
02:34You can click the Filter arrow again and clear the Filter.
02:39As I mentioned before, quartiles represent the 25th, 50th, and 75th percentile,
02:45plus the lowest and highest values.
02:48So let's say that you wanted to find the 25% level, which is the first quartile.
02:53To do that, you create a formula using the QUARTILE function, so it's =q. QUARTILE
02:57is the first option. Press Tab to accept it. And it's Table1, Sales, comma, and
03:08quart is short for quartile.
03:11You can have quartiles from 0 to 4. And in this case we are looking for 25%,
03:16so that is the first quartile. Right parenthesis to close the formula. And the
03:21value is 405.5. Even though the data set doesn't include any fractional values,
03:28the quartile value says that 25% of the data will be below that value, which
03:33means that if there's a 405 it will be in the bottom 25%, or in the first quartile.
03:38Let's do the same thing for 75%.
03:41Type =QUARTILE, Table1, Sales, comma. And we are looking for the third quartile.
03:51You can calculate the quartile by dividing the percentage by 25. So in this case,
03:56we have 75% divided by 25 is 3, so we are looking for the third quartile.
04:01Everything looks good. And that value is 1,835.5.
04:05So a quarter of the values will be above that level, and 75% will be below.
04:11If you were to set the quartile to 0, that would return the lowest value, and
04:15it's equivalent to the MIN function.
04:18So in other words, if I were to type QUARTILE, Table1, Sales, and then 0, that
04:26would return the lowest value, which is 159.
04:29If I were to edit that formula and change the quartile to 4, that would be the
04:36highest value, and would be equivalent to the max formula.
04:39So if I hit Return, we get 2,347, which is the largest value.
04:46The last topic I'd like to mention in this movie is called the
04:49interquartile range.
04:51You can find two adjacent quartile values and subtract the lower from the higher
04:54value to let you know how much spread you have between your quartiles.
04:57For example, in this case, I will calculate an interquartile range for the third
05:03and the second quartiles.
05:05To do that, type =QUARTILE.
05:08The array is Table1, Sales, quartile 3, and then right parenthesis, minus, quartile,
05:21Table1, Sales, quartile 2. Right parenthesis. Everything looks good. Hit Return
05:29and you see that the difference between those two quartile markers is 650.5. An
05:35interquartile range gives you an idea of how spread out your data is, and like
05:39the median--which I discussed in a previous movie--is relatively unaffected by
05:43unusually high or low values.
05:46In the next movie, I'll show you how to calculate a data set's variance and
05:49standard deviation, which give you even more powerful insights into your data.
Collapse this transcript
Calculating variance and standard deviation
00:00In previous movies, I've shown you how to summarize your data using a variety of
00:03techniques: from finding the mean,
00:06also called the average; the median, which is the middle value when you sort
00:09your data into an ascending or descending order; and the mode, which is the most common value.
00:15In this movie, I'll show you how to use a data set's mean to derive two
00:18very useful statistics:
00:20variance and the standard deviation.
00:23As the name implies, variance is a measure of how much the data in a set
00:27diverges from the average.
00:29You calculate variance by subtracting each individual value from the average,
00:32squaring the result, adding the square values together, and dividing by the
00:36number values in the data set. That's a mouthful.
00:39It's a lot to keep in your head at one time, so I'll show you an example.
00:43The data set that I have contains five values: 5, 10, 15, 20, and 25. To calculate
00:48the variance for these cells,
00:50I will type the formula, equal, and then a left parenthesis because I'm going to be creating a two-step formula.
00:57and that means that I need to use parenthesis to group what I'm doing.
01:01So you'll see how it works out in a second.
01:03So it's =AVERAGE(, and then we're going to be finding the average of the values
01:11in the range A1 through A5. So it's A1.
01:15I am going to copy this formula to other cells, but I don't want this reference
01:19to change when I do.
01:20So instead of using a relative reference like I have now, I'm going to press F4
01:25to make it an absolute reference which won't change.
01:28Now I will type :A5, and again F4 to make it an absolute reference, and a right parenthesis.
01:37Now I'm going to subtract the value in A1 from the average and square the result.
01:43So I subtract A1, and I do want this cell reference to change when I copy the formula,
01:50so I'll leave it as a relative reference. Type a right parenthesis. And now I'm
01:55going to square the result.
01:57So that is a caret, or capital 6, if you think of it that way 2.
02:02So this formula finds the average of the range A1 through A5, subtracts the
02:07value in A1 from that average, and then squares the result.
02:11When I press Enter, Excel displays 100.
02:14Now when I copy that formula down, we get 100, 25, 0, 25, and 100.
02:24And let's take a look at the value in B1, the result of the formula, to make sure
02:27that it makes sense.
02:29So we have the values 5, 10, 15, 20, and 25 in A1 through A5, so we have 15, 30,
02:3550, 75 divided by 5, so the average is 15.
02:40If you subtract 5 from 15, you get 10, and if you square the result, you get 100.
02:45So the formula is working properly.
02:47Now to calculate the variance, we create the formula =SUM,(B1:B5),
02:56and I'm not going to copy the formula anywhere so it doesn't matter that
02:59it's a relative reference.
03:01And then we divide it by the number of values in our range, so that would be
03:05COUNT(A1:A5).
03:09So we are finding the sum of the variances squared and dividing by the
03:17count of A1 through A5.
03:20When I press Enter, we get a variance of 50.
03:24There is an easier way to do that.
03:25I just wanted to walk you through it to show you.
03:28You can use the variance or variance for the population function to get the same result.
03:34So to do that, you type "=VARP"--which calculates the variance based on an entire
03:39population--and then values in A1 through A5. Right parenthesis. Press Return
03:48and we get the value of 50.
03:49There are two functions that you can use to calculate variance:
03:53VAR and VARP. The one that I used here,
03:56VARP, is what you use when you know every data point in a population. In other
04:01words I know that this is my entire data population, so I can use VARP.
04:07If, on the other hand, I'm working from a sample, and I don't know every value in
04:10a data population, then I need to use the VAR formula.
04:14So if I were to type "=VAR(A1:A5)" and press Tab, then I get the value 62.5, which
04:26is different from the VARP formula. Why is that?
04:30Well, the reason is that it divides the sum of the squares by one less than the
04:34size of the population.
04:36In other words, instead of taking COUNT A1 through A5, it divides by COUNT A1 through A5-1.
04:42So let me change that here by subtracting one from the COUNT: just add a
04:48parenthesis, minus one, and then a right parenthesis to close it.
04:53The formula as it reads now produces the same result as the VAR function.
04:58Dividing the sum of squares by one less than the size of the population makes it
05:03a more conservative measure, which reflects the uncertainty inherent in
05:07calculations using a sample of a population.
05:10In practice, we use the VAR function much more often than VARP, because we
05:13rarely have data from the entire population.
05:16Now variance is interesting, but it's not used that much in statistics.
05:21However, you can use a data set's variance to calculate its standard deviation.
05:25To find a data set's standard deviation, you take the square root of the variance.
05:30So, for example, if the variance for this data set were 62.5, and we take the
05:37square root of that value, which is in B9, and press Tab, we get 7.9, so on.
05:47So is there a separate function to calculate standard deviation?
05:50Well, of course there is. There are two.
05:52And in the same way that you can calculate variance based on either a sample or
05:57the entire population and use the VAR or VARP function, standard deviation has
06:03the STDEV and STDEVP functions that you can use.
06:07And I will show you both of them.
06:09So, for example, if you wanted to find the standard deviation of the values
06:12in A1 through A5, you can type "STDEV(A1:A5)", Return, and you get the same
06:23value that we had here.
06:25If you know all the values, you can use STDEVP, which is =STDEVP(A1:A5), Return,
06:38and you get 7.07, which, as it turns out, is the square root of 50.
06:44So, everything matches up.
06:46Standard deviation provides extremely useful information about a data set.
06:50As I'll show you throughout the rest of the course, knowing a data set's standard
06:53deviation lets you make predictions about the entire data set, even if you only
06:57know a portion of the values.
Collapse this transcript
Summarizing data using descriptive statistics
00:00Excel has quite a few statistics functions you can use to summarize your data.
00:04Rather than make you create individual formulas for each measure, Excel groups
00:08the most commonly used formulas into a single descriptive statistics tool, which
00:13is available through the Analysis ToolPak.
00:15To use the Analysis ToolPak, you click the Data tab. And then if you installed
00:21the Analysis ToolPak add-in, you will see it here under the Data Analysis
00:26button, in the Analysis group.
00:28If you don't see that button on the Data tab, then you need to install the
00:32Analysis ToolPak add-in.
00:33I explained how to do that in a movie earlier in this course.
00:37To begin your analysis, click the Data tab and then click Data Analysis. And in
00:43the Data Analysis dialog box, click Descriptive Statistics and click OK.
00:49When you do, the Descriptive Statistics dialog box opens.
00:53You can now select your Input Range. And the data that we want to analyze is in
00:57this column of our table, so we will click the Collapse dialog button at the
01:02right edge of the Input Range box, and then select this table column.
01:08To do that, you move your mouse pointer over the header until the mouse pointer
01:13changes to downward-pointing black arrow, and when you click, Excel select cells
01:19B2 through B32, skipping the header cell here in B1.
01:24So our data is in B2 through B32, so everything looks correct.
01:29The data is grouped by columns. And we don't have a label in the first row, so
01:33we don't need to check that box. And I do want to put the data on a new
01:37worksheet, so I will leave all of those the same.
01:40If I did wanted to put my descriptive statistics on the same worksheet, then I
01:45could select the Output Range option, and select another cell, such as D2. But in
01:51this case, I do want to put it on the new worksheet, and I will click the
01:55Summary Statistics option.
01:58Everything looks good. Click OK and Excel creates my new worksheet.
02:03I am going to format it a little bit, just so it makes it easier to read.
02:07So I will double-click the right edge of the column A column header.
02:11When I do, Excel expands the column to display the widest value, Standard
02:17Deviation in column A, and it also displays the values for each of these
02:22measures in column B. And I'll zoom in just a little bit to make the data easier to read.
02:29The Descriptive Statistics tool calculates many different measures.
02:33So, for example, you have the mean, which is the average, standard error, which is
02:37the Standard Deviation divided by the square root of the sample size.
02:41You have the media and the mode-- in this case there was no value that
02:43occurred more than once--standard deviation, variance--which, again, is the
02:48square of the standard deviation.
02:50Now these two values we are not going to be using in this course, but I'm going
02:54to explain them briefly.
02:55Kurtosis represents how peaked a curve is; in other words, if you look at
03:00the data on a curve, how many values occur toward the center versus how many
03:05occur toward the side?
03:07If you have a lot more values toward the center, then kurtosis is higher.
03:10In this case, the data set is fairly well spread out or fairly even, so
03:14kurtosis is quite low.
03:16Skewness tells you whether the tail of the curve is to the left of the average,
03:19which is negative skew--which is what we have here, but just barely--or to the
03:23right, which is positive skew.
03:25And finally, range is the spread between the minimum value and the maximum value.
03:31If you want to take a general look at your data set before performing more
03:35detailed analyses, running the Descriptive Analysis tool provides a host of
03:39commonly used analyses in a few steps.
03:41Once you've looked over these values, you can continue your analysis using
03:45other methodologies.
Collapse this transcript
Summarizing data using z-scores
00:00Statisticians divide data sets into two general categories based on size.
00:05Large data sets have 30 or more elements, and small data sets have less than 30.
00:10In this movie, I'll show you how to analyze data sets using z-scores, which
00:14are used with large data sets.
00:17A z-score and also a T score--which we will see in the next movie--represent the
00:21number of standard deviations a value is from the mean.
00:24To look at some values we have here, I have a data set over here on the left
00:29that are daily sales for the month of March. And I happen to know, because I
00:35created the data, that the mean, or average, of sales is 1,151.
00:39So what I am going to do is use Excel's ZTEST function to calculate the
00:45probability that this mean of 1,200 is greater than the population's actual mean
00:51based on the mean from the sample.
00:53So this sample here is not the entire data set; it's a sample of it. And the mean
00:59of the sample is 1151.
01:01So the question is what is the probability that the mean of 1200 is greater than
01:06the population's actual mean based on the data we have?
01:09So in cell D5, I will create the formula =ZTEST, and the data we are going to
01:16use is in the Sales column of the Daily Sales table.
01:20So I'll move my mouse pointer over the header of the Sales column, and when the
01:26mouse pointer changes into a downward- pointing black arrow, click, and Excel
01:30fills in that table reference in the cell.
01:34Now type a comma. And then x is the mean that we want to test.
01:39That value is in cell D8, so we will click there. Right parenthesis. Formula
01:45looks good. Type Return and we get the value of .6479.
01:50So that means that it's about 65% likely that this mean of 1,200 is greater than
01:56the mean of the data population.
01:59Now how confident are we in our 64.7%, or 64.8?
02:05Well, for that we can use the z-scores for various confidence levels.
02:09A confidence level indicates the number of values that will occur within a
02:14specific number of standard deviations of the mean, plus or minus.
02:18So here we can see that 50% of the values in a data set will occur within .674
02:27standard deviations of the mean, plus or minus.
02:30And we are assuming that our data set conforms to the normal data distribution.
02:35That's the most common distribution, so we use it.
02:37But if you for any reason suspect your data might not conform to the normal
02:42distribution, then you won't be able to use z-scores. But most of the time it will.
02:46Within one standard deviation of the mean plus or minus, you will find 68.3% of
02:51your values; within 1.645 standard deviations, 90%; and the most commonly used z
02:58score of 1.96 standard deviations plus or minus, you will find 95% of the values
03:04in your data set within 1.96 standard deviations of the mean.
03:09Calculating the number standard deviations a value falls from the mean helps
03:13you analyze your data in many different ways.
03:15Z scores are used with large data sets, and they help you decide how confident
03:19you are that a value falls within a certain range on the curve.
Collapse this transcript
Summarizing data using t-scores
00:00In the previous movie, I discussed z-scores, which you use to analyze data sets
00:04with 30 or more elements.
00:06In this movie, I'll discuss t-scores, which you use with small data sets--those
00:11with fewer than 30 elements.
00:13Like a z-score, a t-score measures the number of standard deviations from a mean
00:17that a value is, but the formula takes the smaller sample size into account.
00:22In Excel, you calculate a target t-score by using the TINVERSE function, which
00:27takes two arguments of probability and degrees of freedom.
00:30So the first thing we need to do is to calculate the probability, and the
00:35probability we are calculating is the probability that the difference
00:39between two sample averages--in this case, income and average order--is due to random chance.
00:47To find that probability, we are going to create a t-test formula.
00:51So in cell E1 type =TTEST(, and then in the Levels
00:58table, which is here on the left, we are going to use the Income column.
01:03So I move my mouse pointer until it is a downward-pointing black arrow over
01:08the Income column header, and when I click, Excel fills in that reference in the formula.
01:14Now, I type a comma and do the same thing for the Average Order column.
01:20So I have my black arrow.
01:21It's over the column header. Click there and I have my Average Order. And type a comma.
01:28The next argument is tails.
01:30So in other words, if I were to put in the value 2, I would be looking at
01:35whether the value is either higher or lower--
01:37in other words, if it's in the extremes of the curve, either above or below the
01:42average. In this case, I do want that because I am checking to see whether the
01:46two differ--and I don't care in which direction--so I'll type a 2 for the number of tails.
01:51If I only wanted to know whether the value was greater than what we find, then I
01:55would have typed 1. That would look to the right edge of the curve.
01:59Finally, we need to tell Excel what type of t-test we want to perform.
02:03There are three types, and the only type we are going to deal with in this
02:07course is where we are comparing the means of two separate columns of data, and
02:12that is test number 1, the most common.
02:16If you go into the Excel Help file, you'll see the descriptions of the other
02:19two, but they're only used in special circumstances.
02:22So we have our formula, and I'll close it with a right parenthesis. Everything
02:26looks good, and we have our probability of a very small number--that is six
02:31zeroes, followed by a 6, percentage.
02:34So in other words, it is a very low probability that the differences between the
02:39averages of income and average order are due to chance.
02:44With that data, we can calculate our t-score.
02:48To calculate the t-score, we use the TINVERSE function.
02:51So we have =TINV(. The first argument asks for our probability, and that is in cell E1.
03:00We just calculated that. Comma.
03:03Then we have our degrees of freedom.
03:06Degrees of freedom is the number of measurements in your data set -1.
03:11The reason it's called degrees of freedom is outside the scope of this course,
03:15but just a way to remember it is you have say 20 measures over here,
03:1920 elements, so your degrees of freedom are 19.
03:23To enact that into the formula, type COUNT( and then the
03:29Average Order column, right parenthesis. And because we're calculating degrees
03:36of freedom, which is one less than the number of elements, we take the count
03:40and subtract one. Right parenthesis, and press Return, and we get our value of 10.29.
03:48We interpret that t-score using the number of degrees of freedom and
03:53our confidence level.
03:54So in other words, if we want to be 95% certain, that the difference in the
03:58income average as compared to the difference of the average order is
04:03statistically significant at the 95% level,
04:07we need to go to our t lookup table, which I have on this separate worksheet.
04:12So again, our t-score is 10.29.
04:15Go in to the other worksheet, we can scroll down in our Degrees of Freedom
04:19table, and with 19 degrees of freedom, we see at the 95% level that we need to
04:25have a t-score of 1.729 or greater.
04:28Our t-score is over 10, so it's very likely that the difference between the
04:34averages of income and order vary together significantly.
04:39Calculating the number of standard deviations a value falls from the mean
04:42helps you analyze your data in many different ways.
04:44T-scores, which represent the number of standard deviations from the mean in a
04:48small data set, figures prominently in these formulas.
04:51We'll use it many more times later on in this course.
Collapse this transcript
Calculating covariance
00:00Business owners are always looking for relationships between two sets of data--
00:04whether sales discounts results in higher revenue or whether the colors used in
00:08an advertisement make more customers come to your store.
00:11In this movie, I'll show you how to calculate one measure of change: covariance.
00:15In the next movie, I'll show you how to calculate a related measure
00:19called correlation.
00:20Covariance is the average of the products of deviations for each data point in
00:25a pair of data sets.
00:27In other words, you find the average or mean for two columns of data, calculate
00:31how far each value is from its column's average, and then multiply the
00:35difference for each pair of values.
00:37So let me show you what that means within the worksheet here.
00:41The question I'm interested in answering is whether temperature affects the
00:45number of customers that come in to my store.
00:48So I have a column of temperature data and customer data, and these are matched.
00:52So in other words, on the day that the temperature was 50 degrees; 24 customers came
00:57into the store; and the day it was 53, 20 came in; and so on.
01:00So what I want to do is subtract 50 from the average in this column, square that
01:06difference, and then add it to the same calculation of the customers.
01:09I want to calculate the amount that 24 customers is below, or different from, the
01:15average number of customers that I have per day, and then add those two
01:19differences together.
01:20In cell C2, I'll type "=(A2-average(", and then I'm going to be working with the
01:31data in the Temperature column,
01:33so I move my mouse pointer over the column header.
01:36When it turns into a downward- pointing black arrow, I can click, and Excel
01:41enters in the Temperature column.
01:43And the reason it says Temperature instead of the entire table reference is
01:47because I'm referencing this column of the table from another column or
01:51cell within the table.
01:53So it doesn't need to say the table name; it just says, look in this table in
01:56the Temperature column.
01:58I type 2, right parenthesis, so that I have A2, minus the average of that columns
02:04values, multiplied by, left parenthesis, B2, and we are creating the same formula
02:10for the Customers column.
02:12So it is B2-average, and we're selecting the Customers column.
02:18Now, I can press the right parenthesis and then another right parenthesis to
02:23close my function. Everything looks good.
02:25Press Return and you see the values.
02:28To give you an idea of what's going on within this formula, I'm going to leave
02:32cell C3 selected, and then I am going to walk through each individual step of
02:37this formula, so that you can see how it's getting its value.
02:40To do that, I will click on the Formulas tab on the Ribbon and click the
02:44Evaluate Formula button.
02:47So, Excel displays the formula, and it underlines the next element that is
02:52going to be calculated.
02:53So if I click the Evaluate button, Excel fills in the value of 53 from cell A3;
02:59that was the reference that was underlined before. Click Evaluate again, and
03:03now it's going to look at the average of A23 through A26. That average is 64.64.
03:09Now we're going to subtract this, too. Evaluate and I'll just step through.
03:12We have the result of that previous calculation, and now we're going to look at
03:17the value in B3, which is 20. Find the average of the customer's value.
03:21First, it identifies all the cells in that column.
03:24Then it finds the value 44.08. It subtracts 44.08 from 20, and then it
03:31multiplies -11.64 by -24.08.
03:36Click Evaluate and you get your difference of 280.2912.
03:44So that's how those calculations are done.
03:47So now, how do I use this list of differences?
03:49Well, what I do is I find the average of all those differences. And at the
03:53bottom of my table I have a total row.
03:56Right now, the total row is finding the sum of all of these differences.
04:00What I want it to do is find the average.
04:03So to do that, I click the total row cell that I am interested in, click the
04:06down arrow, and click Average.
04:09You see that I have a covariance of 115.0288.
04:13There is a quicker way to do this, and that is by creating a covariance formula.
04:18So in cell F1, I am going to click =COVAR.
04:23C-O-V-A-R, that's the name of the function you use, and then it's array1 and array2.
04:28Array1 is Temperature, and a comma, and then Customers ByTemp--
04:34that's the name of the table--and then the Customers column. Right parenthesis.
04:38Everything looks good. Press Return and I get my same value of 115.0288.
04:44A positive covariance means the variable seem to be positively related,
04:48meaning that they increase or decrease at the same time.
04:51If you multiply two negative values together, you get a positive result.
04:55So the more times both deviations are negative or both are positive, the larger
04:59the covariance will be.
Collapse this transcript
Calculating correlation
00:00In the previous movie, I showed you one way to measure how two variables tend to
00:04change together by using covariance.
00:06In this movie, I'll show you another very useful way to measure how two
00:10variables tend to change together; that is correlation.
00:13correlation is related to covariance, but it uses a different formula
00:17to generate its value.
00:18The idea is that a correlation coefficient can have a value from negative one
00:22to one, and this worksheet that I have here displays those three examples.
00:27And these are all perfect examples, meaning that they'd probably never occur in
00:32real life, but they give the example of what a correlation of 1, -1, and 0 would look like.
00:37Correlation of one is what's called positive correlation, or perfect
00:41positive correlation.
00:42That means that whenever one value increases, the other value increases at
00:46exactly the same rate.
00:48So the data here, we have 1, 2, 3, 4, 5, and that is the data I used to plot
00:53the horizontal axis. And then the vertical axis also contains the data 1, 2, 3, 4, 5,
01:00so it increases at the same rate.
01:03Correlation of negative one means that there's a negative correlation, where one
01:07value goes up, the other goes down, and again, in exactly the same proportion.
01:12So we have 1, 2, 3, 4, 5, and Column 2:
01:155, 4, 3, 2, 1. And you can see how that correlation is represented in the body of the chart.
01:21A correlation of zero means that there is absolutely no relationship between the
01:25two data columns whatsoever.
01:28So, for example, if you look at the first pair, you see that I have 1 and 1,
01:32and that's started here.
01:34The next row is 2 and 1, so that is here, 3 and 1, and so on. And I have every
01:42possible pair of values.
01:43I have a 1 and a 1, a 1 and a 2, and 1 and a 3, and I have similar pairs for 2 and 3
01:48as my starting values.
01:50So that means, as you can see in the plot, that you cannot use the value on the
01:54horizontal axis to predict anything about the data, or the value, that will occur
01:59on the vertical axis.
02:01It's completely spread out, so that means that this example has a correlation of zero.
02:05Now let's take a look at a business case to show you how you can use
02:08correlation in your own work.
02:10To do that, we'll switch to the Correlation worksheet.
02:14The data in this Excel table shows how the number of customers varies by
02:19day, and the days measure is the number of days after your company has sent out a mailer.
02:25So, in other words, you want to see how effective the mailer was, but you also are
02:29interested in how much that mailer's effect trailed off over time, and that is
02:33what this table measures.
02:35If you want to create a correlation formula for this data, you can do so
02:39by clicking equal--and I am in cell E1--so it's equal and then C-O-R-R-E-L,
02:44that's the correlation function. Left parenthesis. And then you just identify
02:48the data you want to use.
02:50So the first column is days, and I selected that column by moving my mouse
02:56pointer over the column header. And when the mouse pointer changed to a downward-
03:00pointing black arrow, I clicked.
03:02When I did, Excel selected that table's column.
03:06Type a comma, and I'll select the Customers column.
03:10So I have Days and Customers, those are my two arrays. Right parenthesis. Press
03:15Return and I get my correlation of -0.9, and a small decimal value after that.
03:21As a general rule, correlation coefficients with absolute values--
03:25that is, the distance from zero--a three or less are considered to be weak,
03:290.3 to 0.7 are considered moderate, and 0.7 or higher are considered strong.
03:35So in this case, a correlation coefficient of -0.9 means that there's a very
03:40strong negative correlation between the number of days after you mailed a
03:44brochure to your customers and the number of customers who come in because of that brochure.
03:49Determining whether two sets of values are correlated is in one sense of
03:52mathematical question,
03:54but you can't take a correlation coefficient of 0.95 to mean that some invisible
03:58factor makes two unrelated data sets change at the same rate.
04:02You must use your common sense and business sense to determine whether the
04:06correlation result is true, and--most importantly--discover why or why not.
Collapse this transcript
Calculating correlations for more than two data sets
00:00In the previous movie, I have showed you how to calculate covariance and
00:03correlation coefficients for two data sets.
00:06Because the functions limit you to comparing two data sets at a time, comparing
00:09every column of data in the collection to every other column can take a long
00:13time, particularly for tables with five or more columns.
00:16If you have multiple data sets, you can use the Analysis ToolPak's Correlation tool
00:21to calculate the correlations between every pair of columns much more quickly
00:25than you could do by hand.
00:26So, for example, let's say that your business captured information about your
00:30customers with their income level, their education level--and in this case the
00:35education level means the number of years of school that they attended--and also
00:39their satisfaction with your company, and their order total.
00:42So our goal in this movie is to look for correlations between each pair of columns,
00:47in other words, between Income and Education, Income and Satisfaction, Income
00:51and Order Total, and so on.
00:53To begin, we will go to the Data tab, and on the right end, we will click Data Analysis.
01:00Data Analysis is part of the Analysis ToolPak.
01:03So if you don't see Data Analysis on the Ribbon then you need to install
01:07the Analysis ToolPak.
01:09I showed you how to do that in a movie early on in this course.
01:12So we will click Data Analysis.
01:14Then in the Data Analysis dialog box, click Correlation, and click OK.
01:21Now we can select or identify our Input Range.
01:25The Input Range is cell A1 through D50.
01:30I'm typing in--rather than selecting--the table columns, because I want to
01:34include my labels in this case.
01:37So I will click Labels in First Row.
01:39That will make a difference in the output that you will see in just a few seconds.
01:43I will select my Output Range.
01:45I will put my output here on the same worksheet.
01:51So F1, identifying the top left corner of the Output Range, and click OK.
01:57When I do, Excel displays the results of the correlation analysis.
02:01I will just open up the columns, so that you can see the entire values.
02:09I do that by double-clicking the right edge of a column header.
02:12That expands the column to display all of the column's contents.
02:16So now I can look at the correlations between each set of columns.
02:20Income is correlated to Income perfectly because it's the same set of values.
02:24Double-click here so we can see the entire names:
02:27Education to Education, Satisfaction to Satisfaction, Order Total to Order Total.
02:32Now we can start looking to see if there are any strong correlations between any
02:35two pairs of columns.
02:37So, for example, Income and Education are correlated weakly at a very low
02:42value, .07--not very significant.
02:45Same thing for Income and Satisfaction.
02:47Income and Order Total are correlated at a .25 level.
02:52But again, anything below .3 is considered to be a weak correlation, between .3
02:57and .7 is moderate, and anything over .7 is seen to be a strong correlation.
03:04In this case, none of our values are particularly high or particularly low.
03:08That is far from 0. So, based on this data, we can determine that there is no strong correlation
03:15between income and education, income and customer satisfaction, income and
03:19order total, and so on.
03:21The Analysis ToolPak's Correlation tool lets you find the correlations between
03:25every pair of columns in an Excel table.
03:27Even if you just want to find the correlation of values in two or three pairs
03:30of columns, using the Correlation tool helps you do so more quickly than you could by hand.
03:35Be sure to check the coefficients for every pair of columns: you might find a
03:39correlation you didn't expect.
Collapse this transcript
Calculating and interpreting chi-square values
00:00One important question you can ask when you analyze your data is whether
00:03the observed frequency distribution of a data set differs from a theoretical distribution.
00:08In this example, we're going to be looking at whether the state a customer is
00:11from affects whether they come to your store or not.
00:14You can find these values by performing what's called a chi-squared test.
00:18The chi-squared test is a non- parametric test, which means that the formulas
00:22don't assume the data you're testing conforms to a known data distribution, such
00:26as the normal, or bell curve, distribution.
00:29Let's say we want to examine the number of customers that came in from three
00:32States--Connecticut, New Jersey, and New York--with a goal discovering whether
00:37the number of customers that come in depends upon the state that they're from.
00:42You can analyze these values using the chi-square distribution.
00:45To do that, you use the CHIDIST function in Excel.
00:50The data that you need to perform a chi- squared analysis is the number of actual
00:55customers that came in from the three States--Connecticut, New York, and New
00:58Jersey--and the expected number.
01:01The expected number is the average of the total.
01:04So in other words here, 143+124 is 267+93 is 360.
01:12If that were divided evenly, then each state would have provided 120 customers.
01:17So we are going to check, using the chi- squared test, whether these numbers--143,
01:22124, and 93--vary significantly from what would be expected.
01:26Now, we can create a formula in cell F1, and that is =CHITEST(.
01:36We have our OrderCount table's Actual column, comma, and our OrderCount
01:41Expected column. Type a right parenthesis.
01:47We see a chi-squared value of .00495.
01:51You interpret the result based on your desired confidence level.
01:55The most commonly used confidence level is .05, which means that you want to be
01:5995% sure that you can reject the statement that a customer's State does not affect
02:05the number of customers that arrive.
02:07Note that the result is much less than .05,
02:11so that means that we can reject the statement that State has no effect.
02:15Now let's say that we change the values a little bit and say we got 125 customers
02:22from Connecticut, 120 customers from New York, and 115 customers from New Jersey.
02:30In that case, the chi-squared value goes all the way up to .8.
02:34That means that we reach the opposite conclusion.
02:37Because the value is greater than .05, we can reject the statement that a
02:41customer's State doesn't affect the number of customers that come in.
02:45In other words, in this data set, it doesn't appear to matter which State a
02:48customer comes from;
02:50we get customers on an about equal level from all three States.
02:54Finally, if we were to change the actual values to the expected values of 120,
02:58120, and 120, we would see that the chi- squared value is one, which indicates that
03:05everything is occurring exactly as expected.
03:08The chi-squared distribution lets you determine whether the frequencies in your
03:11data set differ from the expected counts at a statistically significant level.
03:16This test is particularly versatile because, unlike many of the techniques I
03:20will show you later in this course, you don't need to make any assumptions about
03:23your data's distribution.
Collapse this transcript
5. Using Probability Distributions
Visualizing the normal data distribution
00:00Probably the most familiar graph in any statistics course is that of the normal,
00:05or bell curve, data distribution.
00:07In this movie, I will describe the components of the normal curve and then show
00:11you how to visualize it in Excel by creating an X/Y scatter chart.
00:15In this case, we are analyzing data from the Two Trees Olive Oil Company, and we
00:20are looking at sales for their retail shop.
00:22The average sales per day is 2,000, and the standard deviation is 500.
00:28Earlier on, I created this chart, which I've now rendered as a graphic,
00:33and you can see that most of the values, which are the daily sales will be around 2,000.
00:38As you get closer to 2,500, you will see that the area under the curve increases.
00:44So when you're looking within one standard deviation, which is 500, from the
00:48mean, you'll see that the entire area of the curve, from 2,500 to 1,500, is represented there.
00:55So that means at about 60% of your sales will fall within one standard deviation,
01:00or between 1,500 and 2,500 per day.
01:03Days with sales of $3,500 are rare, as are days with sales of $5,00.
01:09You see how far they are out in the tails of the curve.
01:12Now, I am going to delete this graphic and show you how to create the same thing
01:17using an Excel chart.
01:19So I will press the Delete key and away it goes.
01:22Two things you need to know are the mean and the standard deviation, which we have here.
01:27Now you use an Excel table with three columns to make the calculations that you need.
01:31The first thing you're going to do is calculate the value that is 4 standard
01:36deviations from the mean.
01:38In other words, it's going to be the mean plus four times the standard deviation.
01:44The answer will be 4,000, but let's see how we get there.
01:46First, we create the formula, =A2, which is the number standard deviations, and
01:53then we multiply that value by H1.
01:56Now we are going to be copying this formula down to the rest of the rows in the table,
02:02so we want to make H1 an absolute reference so it doesn't change.
02:05So I will press F4.
02:08And now we need to add the mean, which is in cell F1.
02:11So that's +F1, and I press F4 again to make an absolute reference so it doesn't change.
02:20When I press Enter, Excel copies the formula all the way down through the table.
02:25If I scroll down using my scroll wheel, you'll see that we get down to what is
02:30essentially 0, all the way down at -4 standard deviations.
02:34It's represented that way as -5E-12, which is a very small number.
02:40It's 12 zeros and then a 5.
02:42It's represented that way because of rounding errors in Excel.
02:45The answer is effectively 0.
02:46I am going to scroll back up to the top using my scroll wheel.
02:50Now we need to determine the probability that a specific value would occur.
02:55So in other words, what is the probability, given a mean of 2,000 and a standard
03:00deviation of 500, that we will have daily sales of 4,000?
03:02Well, to do that, we use this formula.
03:06It's =NORMDIST, which stands for normal distribution, a left parenthesis, and then B2.
03:16We are looking for the X, in other words, the value that we are going to compare
03:20to the distribution.
03:21That is value in B2, in this case 4,000.
03:25Then we have the mean, which is 2,000, and that is in cell F1.
03:30We don't want that reference changing,
03:32so I'll press F4, then a comma, and the standard deviation--that's in H1.
03:38Again, I don't want that changing, so I will press F4 to make it an absolute
03:42reference and type a comma.
03:44Now we need to decide whether we want to make the calculation cumulative or not.
03:49Cumulative, if we set this argument to true, would ask the question, what is
03:53the likelihood that the value is greater than the number in B2?
03:59So, any value to the right in the curve of cell B2.
04:03If we make the value false, then what is the probability that the exact value
04:09of 4,000 will occur?
04:11That's what we want.
04:12So I will type in "false".
04:13Type a right parenthesis to finish the formula, press Return, and there we have our values.
04:20So now we need to select the data that we want to put into the chart, and those
04:25are the Value and Probability columns.
04:28So I will select those two table columns, and then on the Insert tab, click
04:35Scatter. And we are creating a scatter chart, not a line chart,
04:39but we are selecting the subtype of scatter chart called Scatter with Smooth
04:43Lines. That compares pairs of values on an X/Y grid, and it draws lines between
04:47them to smooth them out.
04:49So I will click that button, and you see that we have our chart.
04:54The left axis is the individual probability.
04:57In other words, what is the probability of the exact value, 2,000, occurring in
05:02the stated distribution?
05:03The answer is pretty low, as you can see from the values in the vertical axis.
05:09But what you can see is that you have 2,000, 2,001, 2,002, and so on,
05:13so the probability that the daily sales are somewhere around 2,000 is quite high
05:19when you add all of the probabilities together.
05:21Now, let's see what happens when I change the mean or the standard deviation.
05:25I will start by changing the standard deviation.
05:27Let's say that we have a much narrower curve with a standard deviation of 250.
05:34When Excel updates the formula, you'll see that the values between 1,500 and
05:382,500 are much, much more likely, where values out toward the tails are increasingly unlikely.
05:44If I create a chart with a mean of 5,000 and the standard deviation of 2,000, you
05:51will see that the curve is broader.
05:56You might encounter the term "standard normal distribution" or "standard normal curve."
06:01The standard normal curve refers to a normal distribution with a mean of 0 and a
06:06standard deviation of 1.
06:08The normal curve has many useful properties you can use to summarize your data.
06:12Once you know your dataset's mean and standard deviation, you can graph the
06:16curve to get a better feel for how your data is distributed.
Collapse this transcript
Generating random values
00:00Most of the data you use in Excel comes from measurements you've taken from your
00:04company's operations:
00:05sales, salaries, delivery times, and so on.
00:08However, suppose you want to generate random values to test new assumptions
00:12about your business.
00:14For example, if you work for Two Trees, suppose that you have found that your
00:18daily sales averaged $2,000, and that you have a standard deviation of $500.
00:24You can use the techniques I'll show you in this movie to generate random values
00:28under that normal data distribution.
00:31I'm going to start up though, by showing you how to generate random numbers using
00:35the RAND and RANDBETWEEN functions.
00:38RAND generates a number somewhere between zero and one, and the formula is
00:44very straightforward.
00:45You type in =RAND and then a pair of empty parentheses.
00:51The only reason you have to have the parentheses there is because you need to
00:55have them so Excel knows it's a formula.
00:58When you press Enter, Excel generates a random decimal value between zero and one.
01:03Now RAND, and this is also a true of RANDBETWEEN, which we'll get to in a
01:07moment, is what's called a volatile function.
01:10Every time Excel recalculates your workbook, such as when you edit a value,
01:14enter a new value into a cell, and so on, it recalculates the value of the RAND formula.
01:20So if I click on the Formulas tab and go to the Calculation group and click
01:26Calculate Now, you'll see that Excel recalculates the worksheet and displays a new RAND value.
01:31If I do it again, we'll get a new value, and finally, a third time.
01:35The shortcut key for Calculate Now is F9.
01:38So if you'd rather, you can press that and force Excel to recalculate
01:42your entire worksheet.
01:43The next function is called RANDBETWEEN, and this is new in Excel 2007.
01:48The RANDBETWEEN function gives you a discrete data distribution--
01:52that is, whole numbers--between two values that you input yourself.
01:56So, for example, if you type =RANDBETWEEN(, then you
02:03can type the bottom and the top values.
02:05So let's say that we want a random value between one and 10.
02:09It's just 1, 10, right parenthesis, Return, and we get the value of 7.
02:16And you'll notice that Excel updated the RAND function to a new value.
02:21Now let's talk about generating random values under a normal data distribution.
02:26This data distribution, which I have displayed here, has an average or mean of
02:312,000 and a standard deviation of 500.
02:34So I can enter those values, the mean of 2,000 and the standard deviation of
02:41500 in cells B7 and B8.
02:46Now to generate random values using those parameters, I can type equal, and then
02:51we use the norm inverse, which is the normal cumulative distribution inverted--
02:56we use that to generate probabilities-- and then the probability, and this is
03:01where we use the RAND function again.
03:03Probability refers to a number between zero and one.
03:06So we use RAND, and the parentheses, left and right, with nothing in between them,
03:12to generate the random number. Type our mean, which is in cell B7, and the
03:19standard deviation, which is in cell B8. So it's B7, B8,
03:24and then a right parentheses. And when I press Return, Excel generates a value of
03:301,794.85, and that value occurs here in the thicker part of the curve.
03:38If I click Calculate Now again to recalculate the worksheet, Excel generates
03:421,542, which is about here, again under the thicker part of the curve.
03:48Calculate Now, 1,487. We're getting smaller, but we're still in the thick part of the curve.
03:53And I will just do one or two more.
03:541,100, that's getting much further down. And one last one, 2,200, which is again
04:01within one standard deviation, and about here on the curve.
04:04So you can see that values beyond a couple of standard deviations are extremely rare.
04:10It's possible but somewhat cumbersome to model business processes in Excel using
04:14just worksheet data.
04:16If you want to create more useful models, you should learn how to implement them
04:19using Visual Basic for Applications, or VBA.
04:22You can learn more about VBA by visiting in my course "Excel: VBA in Depth," which
04:27is available in the Online Training library.
Collapse this transcript
Calculating binomial probabilities
00:00When you calculate probabilities in Excel, you need to know exactly what type
00:03of results to expect.
00:05In the previous movie, I showed you how to calculate random values under the
00:09normal data distribution.
00:11In this movie, I'm going to show you how to analyze trials that have
00:15two possible outcomes: true or false.
00:18In other words, a customer either will or will not sign up for your mailing list.
00:22In this movie, I'll show you how to calculate the probability for those outcomes,
00:26which are described by the binomial distribution.
00:30To calculate the binomial probability, you need to know four things:
00:34the probability of success, which we have here in cell B1, the number of trials--
00:40the number of times you attempt the task--in cell B2, the number of successes,
00:45which I have in this table in the left column--so I have 0 through 20--and then
00:50you need to know whether you want the specific probability of an individual
00:55outcome happening, or the probability of an outcome or less occurring.
01:00So I'll show you how that works once we get into the formulas.
01:03So I'll click cell B5, and what I want to know is the percentage of a particular
01:09value or more occurring.
01:12To create the formula, I type equal in cell B5, and then type binomialdist or "binomdist";
01:21this is short for the Binomial Distribution function.
01:24Type a left parenthesis, and then we need to know the number of successes
01:27we're interested in.
01:29That is in cell A5, so I'll type the reference for A5.
01:33It's outlined in blue to indicate I got it right.
01:35Then the number of trials are in cell B2, and I don't want that cell reference
01:40changing when Excel copies this formula down to the remaining rows of the table,
01:45so I will press F4 to make it an absolute reference. Comma. And now we need to
01:50know the probability of an individual success.
01:53That is 35% in cell B1.
01:55So I'll type that reference and again make it an absolute reference. Type a comma.
02:01Now, I need to decide if I want it to be cumulative or what's called the
02:04Probability Density function or the point probability of an individual value occurring.
02:10In this case, I want the individual value,
02:12so I will type false. Type a right parenthesis to close the formula. Everything
02:17looks right. Press Return and there we have our percentages.
02:20So as you can see, the percentage of 0 successes is quite low, the probability
02:27of 1 is also below 1%, as is 2, and after that, the probability of a certain
02:32number of successes starts increasing all the way to 7, which is our most common
02:37or expected value. Because we have 20 trials and a 35% probability of success,
02:43when you multiply those two values together, you get 7,
02:46so it's no surprise that 7 is the most common value. And the percentages
02:50gradually decrease as you go past 7 to 8, through 10, and so on.
02:55Now, let's see what the Cumulative Probability function would look like.
02:59I will edit the formula in cell B5, and I will make two changes:
03:05first, I will change cumulative from false to true, and then I'm going to
03:10subtract this value from 1.
03:12In so doing, I will calculate the probability that a given number or more of
03:17successes will occur.
03:18So if I type 1 minus binomial distribution of the arguments, everything looks good.
03:24Press Enter and I get my results.
03:26Now, due to a rounding error, this should be 100%, but it's one of the vagaries
03:31of floating-point arithmetic inside of a computer.
03:34You can see that the probability of one or more is extremely high--well over
03:3999%--two or more is well over 98%, and so on.
03:44So how do you use this information once you've generated it?
03:47Well, if one of your employees consistently gets more than 7 of your customers
03:52to sign up per hour on average for your mailing list, then you can ask them what
03:55they're doing, and use their response as a best practice, and give that
03:59information to your other employees.
Collapse this transcript
Calculating probabilities using the exponential distribution
00:00In the previous movie, I showed you how to calculate probabilities in the
00:03binomial distribution, where processes have two potential outcomes:
00:07either true or false, failure/success, and so on.
00:11In this movie, I am going to talk about the exponential data distribution, which
00:15describes the amount of time that will elapse between two independent events
00:19that occur according to a known average interval.
00:22For example, your company might receive an order via your web site every
00:26five minutes on average.
00:27You can use that information to discover how likely it is that an order will
00:32occur after a given time.
00:34To do that, you use the Exponential Distribution function, and I will
00:38create that in cell A4.
00:41So to find the probability of a value occurring under the exponential
00:46distribution, we type equal and use the Exponential Distribution function, which
00:50is EXPONDIST, and then a left parenthesis.
00:55Now, we need to insert three values.
00:57The first one is X, and X is the number of minutes that we want to examine.
01:02In this case, that value is in cell B4.
01:06Type comma. And now we have lambda.
01:09Lambda in this formula is the inverse of the average, which we find in cell E3.
01:16So we type "1/E3", and we want the E3 cell reference to be an absolute reference
01:25that doesn't change,
01:26so we'll press F4, and that makes the reference absolute.
01:30Type a comma. And now we need to know whether we want a cumulative value or a point value--
01:35in other words, the probability of say six minutes or less occurring, or the
01:40probability of exactly six minutes occurring.
01:43In this case, we want six minutes or less, so we can type "true". Type a
01:49right parenthesis. Everything looks right. Press Return and Excel gives
01:54us our probabilities.
01:56So the probability of one minute or less is about 18%, the probability of two or less is 32%, and so on.
02:04So now how do we use this information, and why in the world did I structure my
02:08table this way, because normally you would expect to see minutes on the left, and
02:12the probabilities on the right.
02:13Well, I did that, because what I want to do is use the exponential distribution
02:18to create a series of random arrival times.
02:20In other words, I want to use random number functions to determine the arrival
02:25times for the next 10 orders.
02:28To do that, you use a formula that includes VLOOKUP, which is a lookup formula,
02:35and I'll walk you through it as I go along.
02:37We have an equal, and then we have to use IFERROR.
02:43The IFERROR function returns a value that you specify if a formula returns an error.
02:48The reason we have to use it in this case is that if the VLOOKUP formula tries
02:53to look up a value of less than 0.18, then it will generate an error. And we
02:58want data, not errors, in our worksheet,
03:00so I need to tell Excel what to do by using the IFERROR Function.
03:04So I have iferror to start. Type a left parenthesis, and now we can create
03:09the VLOOKUP formula.
03:10So, VLOOKUP, which is a vertical Lookup, and it will use a value in the first
03:16column of our data table to find a value in the second column of the data table.
03:21So we have (VLOOKUP(RAND and open and close parentheses, which is going to
03:27generate a random number, comma, TimeTable, which is the name of the table that
03:34contains the probability and the minutes, column 2, which is where we want the
03:41value to be returned from.
03:42In other words, Excel using the VLOOKUP formula always looks up the values in
03:46the left column, and it returns values in the column that you specify--
03:51in this case, we want it to be the second column. And then we need to tell Excel
03:55whether we want an exact match or approximate match.
03:58Because we are looking at values that are either less than or greater than other
04:01values, we need to use an approximate match, and that is true.
04:07Now, right parenthesis. But I'm not quite done because I need to tell Excel
04:10what to do if it encounters an error--
04:13in other words, if the RAND function within the VLOOKUP formula generates the
04:17value of less than 0.18.
04:18Well, if that's the case, we want the value to be 1 because that is the number of minutes.
04:23So type a comma, 1, a right parenthesis to close the formula. Press Enter and we get 1.
04:30Now, I'm going to copy this formula down, so that it covers those values, and
04:36you see that we have our arrival time: 7, 7, 1, 2, 4, and so on.
04:41If I recalculate the worksheet by going to Formulas and clicking Calculate Now,
04:46we get a different set of arrival times.
04:49Another way you can analyze your data using the exponential distribution is
04:53to see what is the probability of an order happening within the next say four to seven minutes?
04:59So we have our probabilities here, and we know the probabilities for four minutes
05:03and the probability for seven minutes or less.
05:06So that means we can subtract the probability of four minutes from the
05:09probability of seven minutes to find out the probability that an order will occur
05:13within four to seven minutes.
05:15So to do that, type equal. And then the first value we want is the probability of
05:20an order occurring in seven minutes or less.
05:23That's in A10. And we subtract the probability of an order in four minutes or
05:30less. That is in A7.
05:31You can see the two cells highlighted here in blue and green.
05:36When we press Return, Excel give us a probability of about 20%.
05:41The exponential distribution helps you model the times between events, such as
05:45customers placing orders on your web site.
05:48You should track these important statistics consistently.
05:51As your business grows, you'll probably find that the average time between
05:54orders decreases, and you'll need more help to process them.
Collapse this transcript
Calculating values using the Poisson distribution
00:00Most companies have web sites, and most of them allow their customers to place orders.
00:05You can use the exponential data distribution to describe the amount of
00:07time between two orders.
00:09In this movie, I'll show you how to use the Poisson distribution, which was
00:13discovered by a gentleman name Poisson, to describe the number of events that
00:17will occur within a given time, such as orders per hour.
00:20So in this example, I'm assuming that my web site receives an average of
00:25six orders per hour.
00:27So my goal is to create a table of the probability that I will receive 0 orders,
00:321 order, 2 orders, and so on every hour.
00:35I can then use those values to create a timeline showing me how many orders that
00:41I will receive on a random basis just over the course of 10 hours.
00:45To calculate the Poisson distribution of 0--in other words the probability of
00:49receiving 0 orders per hour--I create this formula, clicking in cell A2:
00:56=P-O-I-S-S-O-N, Poisson, and x.
00:59x is the number that I want to examine, and that is in B2.
01:05The mean is the average, and that is in cell E1. And I want to make this
01:11reference absolute, so it doesn't change when Excel copies the formula down to
01:14the rest of the rows in the table.
01:16So I will press F4 to make that an absolute reference. Type a comma. And then
01:21the question is whether I want this to be cumulative or not.
01:24In other words, do I want to know the probability of getting exactly 0, or do I
01:28want to know the probability of getting 0 or less?
01:32In this case, I want to know the probability of 0 or less, so cumulative is true.
01:39Close with a right parenthesis, press Return, and Excel fills in my table.
01:45So you see that the probability of getting 0 orders is quite low. Then I get up
01:49to 6% of the time I'll get 4 orders, 15% of the time I'll get 5 or less, 28% of
01:55the time 6 or less, and so on.
01:57So now the question is, how do I use these values meaningfully in a
02:01business situation?
02:02The answer is that you can model your business processes and get an idea of how
02:07many orders you could receive, so that you can analyze your data and your
02:11business operations.
02:12The first thing we need to do is create a series of random values between 0 and
02:171, and to do that we use the RAND function. Type =RAND() and press Return.
02:28When I do, Excel displays a series of random values in the second table that I have
02:33here in columns D and E.
02:35Now I don't want these value changing every time I recalculate the worksheet,
02:38and they would because rand is a volatile function that updates its value every
02:43time the worksheet changes,
02:45so I will select the entire column, and then click Copy.
02:50I could also press Ctrl+C. And then on the Clipboard group of the Home tab of the
02:56Ribbon click the Paste button's down arrow, and click Paste Values.
03:01When I do, Excel replaces the formula with the result of the formula.
03:05Now I can create a series of VLOOKUP formulas that find the corresponding
03:09probability--say, for example, 0.16-- over here in the Probability table.
03:16So the value 0.16 would return 4 because the value of 0.28 is larger than 0.16,
03:24and the value of 0.15 is smaller, so that means that we will have the
03:28probability of four events during that hour.
03:32So the formula that I create is =VLOOKUP(, and now the Lookup value. The value
03:41I'm going to be looking up is in cell D4.
03:45So type that in. Type a comma.
03:47Now we have the table_array, and that's the name of the table that contains the
03:51values. And that is Poissontable. Just type that in. And then we have the
03:59column index number.
04:00The column index number is the number of the column, counting from left to right,
04:04that will return the value for this formula.
04:07In this case, I want the values in the second column, column B:
04:11the number of events during a given time period, so I will type in 2. And then
04:16we need to put in whether we want an approximate match or an exact match.
04:20In this case, we are allowing an approximate match because it's extremely
04:24unlikely that any of these values will match up to any of these values, so we
04:28are looking for less than or equal to.
04:31That means we can use an approximate match, and I will press Tab to accept TRUE.
04:36Everything looks good. Type right parenthesis to close, and we receive a series
04:42of values that indicate the number of orders that will occur within the next
04:46time period--in this case an hour.
04:48Like the exponential distribution the Poisson distribution lets you model your
04:52company's operations by creating random numbers of orders per hour based on the
04:56known or assumed average.
04:57Also like the exponential distribution, you can use the Poisson
05:00distribution most effectively if you know how to build programs using
05:03Visual Basic for applications.
Collapse this transcript
Calculating dependent trials using the hypergeometric distribution
00:00If you've ever read anything about statistics, or math in general, you might have
00:04encountered problems that tell you you have a vase with 15 marbles, 10 of which
00:08are red and 5 of which are blue, and asks you to calculate the odds of selecting
00:124 marbles, and picking exactly 2 red and 2 blue.
00:16Although these situations are somewhat rare, business analysts do occasionally
00:20need to perform this type of calculation, which is described by the
00:24hypergeometric distribution.
00:26Now, let's suppose your company wants to give away cash prizes in a drawing,
00:30which is structured as follows.
00:32You have 100 envelopes in a barrel. 20 of the envelopes contain $500, and the
00:37other 80 contain $20.
00:39A lucky customer can draw 10 envelopes, and keep whatever money they find.
00:42What you want to know are the odds of a customer drawing no $500 envelopes, one
00:47$500 envelope, and so on.
00:50To calculate the probability of drawing a specific number of target items--in
00:54this case $500 envelopes--from a collection, you need to know four things:
00:58The specific number of items drawn-- that means the items drawn of 10.
01:04As I stated earlier, you're allowing the customer to draw 10 envelopes from the barrel.
01:08Second, you need to know the number of target items in the collection, and
01:12that is the number of $500 envelopes, and that is 20.
01:16Then you need to know the total number of envelopes that are available--that is 100.
01:20Finally, you need to know the number of $500 draws.
01:25In other words, you want to calculate the probability that the customer will
01:29draw zero $500 envelopes, one $500 envelope, two, and so on.
01:34I'll show you some more things that you can do with those numbers toward the end of this movie.
01:39But for now, let's just find the probability of your lucky customer drawing
01:43zero $500 envelopes.
01:45I'll click in cell E2, and type "=hyp".
01:51The bottom function after I type "hyp" is hypergeometric distribution.
01:56I won't try to pronounce the abbreviated form.
02:00But when that is highlighted, and I press the Down Arrow key to highlight it,
02:03I'll press Tab, and we have the function.
02:06Now, I can start adding in the arguments.
02:08So we have the number of successes.
02:09That's the first argument, and that value is in cell A2,
02:13in other words, the number of $500 envelops they draw. Then a comma.
02:17Next is the number of items that are drawn.
02:20That's in cell B2. The number of successes that are available in the population.
02:24In this case, that means the number of $500 envelopes.
02:27So, we can put that in C2.
02:30Then the number of items in the entire population, including both successes
02:33and what we're determining failures or $20 envelopes, and that is 100, and
02:37that is in cell D2.
02:39Press the right parenthesis, press Return, and Excel fills the formula down to
02:44the rest of the cells in this table column.
02:47So we can see that the probability of drawing zero envelopes with $500 in them
02:52is about 9.5%, the probability of drawing one is 26, almost 27%, the probability of
02:58drawing two is about 32%, and so on.
03:02Now to do further calculation, we need to determine what each prize
03:05drawing would be worth.
03:06So, for example, if a customer draws zero $500 envelopes, how much money would they get?
03:12The way that we calculate that is we type equal, and then the number of
03:16$500 draws times 500.
03:19So that would be (A2*500)+(10-a2)*20.
03:36So in other words, we are taking the number of $500 draws from cell A2, and
03:40multiplying it by 500, and we're adding 10 minus that number, and multiplying it by 20.
03:46Press Return, and we see that the value for each of the prizes goes up by $480.
03:51When we replace a $20 envelope with a $500 envelope, the difference between
03:56those two values is 480.
03:57So these values make sense.
03:59We can see that if you draw one $500 envelope, you get 680, 1160 if you draw
04:05two, and so on, all the way up to--if you get incredibly lucky, run between the
04:09raindrops and draw ten $500 envelopes, you would win $5,000, which is the
04:14maximum price available.
04:16Now, we can calculate what's called the expected value.
04:19In other words, you multiply the probability of a particular outcome by the
04:23value or prize for that outcome.
04:26After I calculate the expected value for each one of the possibilities, then
04:30I can add all those expected values together to get the total expected value for the drawing.
04:35So to do that, I will just multiply the probability by the prize.
04:39So that's =E2*F2. Press Return and we get our expected value for each of
04:49these possibilities.
04:50Now, note that the values are small enough for drawing nine and drawing ten that
04:55the expected value is displayed as 0.
04:57But if I increase the number of decimal points for the entire column, you see
05:03that if you look far enough that there is an expected value, but it's about one
05:08ten-thousandth of a cent.
05:10Now what I can do is add a total row to the table.
05:13To do that, click any cell in the table.
05:15Then on the Design contextual tab, in the Table Style Options group, click the
05:20Total Row check box. We get the sum.
05:24We see that the expected value is $1,160.
05:28The hypergeometric distribution doesn't come up very often in business, but if
05:32you want to structure a prize drawing of the type I described, you'll know how
05:35to figure out the odds and the expected value.
05:38Check your local laws, make sure this sort of thing is legal, and be sure to set
05:41aside enough money to pay the maximum prize in case someone gets really lucky.
Collapse this transcript
Calculating probabilities, permutations, and combinations
00:00If you studied probability in school, you've most likely calculated odds, and
00:04found the number of ways you can combine various items into groups.
00:07In this movie, I'll show you how to calculate those probabilities, and when to
00:11use which technique.
00:12To calculate the probability that something will happen, you need to divide the
00:16number of desired outcomes by the total number of possible outcomes.
00:20For the purposes of this exercise, we are assuming that all the
00:25probabilities are equally likely.
00:27For example, if you flip a coin--at least in theory--it is equally likely to
00:31come up heads or tails, 50% either way.
00:34But let's see how we calculate that value.
00:35Well first, we need to know the number of possible outcomes.
00:39So in cell C4, I will type 2, because we have heads and tails as possible outcomes.
00:45Then the number of desired outcomes, which is the probability that a coin
00:49will land heads up,
00:50there is only one way we can succeed, so type 1.
00:54Then we can divide the number of desired outcomes, which is in cell C5, by the
01:00total number of possible outcomes, which is in C4, press Return, and we get a
01:06probability of 0.5, which is correct.
01:08When you flip a coin, it can land either heads or tails and the probability
01:12of either one is 0.5.
01:13Now, let's say that you want to calculate the probability that a coin will land
01:17heads up twice in a row.
01:19To do that, you need to multiply the probability of it happening once by the
01:23probability of it happening once again.
01:25So, for example, what is the probability of a coin landing heads up? Well, it's 0.5.
01:30We knew that from above.
01:31What's the probability that it happens again?
01:34These are independent trials.
01:36The coin has absolutely no memory that it landed heads up, so it is 0.5.
01:42To calculate the total probability, you multiply those two values together.
01:46So it's =C9*C10. Return. And you get a total probability of a coin landing heads
01:54up twice in a row of 0.25, or 1 out of 4.
01:58If you visualize the situation, it makes sense.
02:01If you flip two coins, there are four ways that they can come up:
02:04either two heads, which is the success that we're looking for, or they can come
02:09up heads and tails, tails and heads, or two tails.
02:13So you have one success, three failures, and one out of four is 25%.
02:19Now, let's say that you want something or something else to happen.
02:24In other words, if you roll a standard 6-sided die, you want to know the
02:27probability of a 1 or a 2.
02:30To calculate that, you need to know the probability of a 1 coming up, and
02:33that is 1 out of 6.
02:35So, in cell F4, I'll just type =1/6.
02:40Then there is the probability of a 2.
02:42That's exactly the same.
02:43You only have one 2 on a 6-sided die, so the probability is =1/6. The total
02:51probability of getting one thing or another thing where you have two possible
02:54successes is to add the probabilities together.
02:58So you have =F4+F5. Return. And you see you get a total probability of 0.33 or
03:07one-third, which makes sense, because 2 out of the 6 sides of the die will be
03:12considered successes,
03:14so that means you have 2/6, which is one-third, which is what we see here.
03:18The final example I use when it comes to analyzing this type of probability is
03:22the probability that something won't happen.
03:25To do that, you subtract the probabilities of something happening from one.
03:30So, for example, let's say that I don't want the die to come up 1 or 4.
03:33Well, what's the probability of a 1?
03:36Well, that's 1/6th, because it's 1 of 6 possible values. So it's =1/6.
03:42The probability of a 4 coming up is exactly the same.
03:44There is only one 4 available on a 6-sided die. So that's =1/6.
03:49Then the total probability is =1-(F9+F10).
03:59So remember, these are the probabilities that I don't want,
04:03so I subtract them from 1.
04:05When I press Return, we get 0.666667, which is equivalent to two-thirds, which make sense.
04:12I don't want a 1 or a 4 to come up.
04:14So that means that there are four possible ways to succeed rolling any number
04:18other than a 1 or a 4.
04:20That means you divide 4 by 6, which is two-thirds, which is the value that
04:25you see in cell F11.
04:26Now, let's look at a different type of probability calculation, and those are on
04:30the second worksheet.
04:33These are combinations and permutations.
04:36In business, suppose you create random grab bags of items that customers can purchase.
04:40If there are 15 different items divided into 4 item bags, how many different
04:44combinations are there?
04:46In this case, we're not worried about order.
04:48In other words, if you reach into a bag, and you pull out DVDs, you don't care
04:52of the order that they come out.
04:53You just want to know what you got.
04:55So in this case, we're going to try to figure out how many different
04:57combinations there are of 15 DVDs when there are four items per bag.
05:02So the information we need is 15, which I will type in cell C5.
05:08That's the total number of different DVDs that are available.
05:11Then we have the number of items per bag, which is 4, which I type into cell C6.
05:16Then we have the total combinations.
05:18To calculate the number of combinations, we type Equal, and then we use the
05:22combinations or COMBIN function, left parenthesis, and then we have the number
05:28which is the number of different DVDs that are available.
05:31So type C5, which is that cell reference, a comma, and then we have the number of
05:36items per bag, which is 4, and that is in cell C6.
05:41Type a right parenthesis. Everything looks good. Return.
05:44We see that we get total combinations of 1,365.
05:48Permutations are like combinations, but in this case order matters.
05:53So, for example, suppose you have eight people working in a tradeshow, and you want
05:57to assign one person to each of four shifts in your booth.
06:00How many different booth shift orders are possible?
06:03Well, to do that, you need to use a permutations formula.
06:07So first, I need to put in my total number of individuals, and that is 8, and
06:13then the total number of shifts, which is 4. Press Enter.
06:17Now, I can create my permutations formula.
06:19That is =PERMUT(, and then we use the same arguments as we did for the
06:27combinations formula.
06:29So we have the number, which is the number of individuals, and that is in cell
06:32C11. Type a comma, and then we have the number chosen.
06:36In this case, that is the number of shifts, and that is in C12. Type a right
06:41parenthesis. Everything looks good.
06:44We see that we have a total number of different orders of 1,680.
06:48You can notice that even though we had 15 DVDs to choose from and 4 items per
06:53bag, the total number of combinations is less than the number of permutations
06:58when you select 4, but only from a group of 8 individuals.
07:03When order matters, the number of possibilities expands rapidly.
07:07Now finally, a word of warning about probabilities, and I'll switch back to the
07:11Probabilities worksheet, so you can see what I'm referring to.
07:14You must avoid what's called the gambler's fallacy, which is thinking that an
07:18outcome is due because it hasn't happened recently.
07:21If you flip a fair coin, and it comes up heads three times in a row, the
07:24probability that it will come up heads the next time you flip it is still 50%, not 6.67%.
07:32The probability that a coin will come up heads four times in a row before you
07:35make the first flip is 6.67%, or 1 out of 16.
07:40But after it's come up heads three times, you're still even money to get a fourth.
07:45The coin is not due to come up heads.
07:47It's an easy mistake to make, but remember that independent trials are
07:50just that, independent.
Collapse this transcript
6. Sampling Values from a Population
Selecting effective data samples
00:00Whenever you want to select members of a data set as samples for your analysis,
00:04you need a selection strategy that produces usable, unbiased data.
00:08In this movie, I'll tell you how to select values you can use.
00:11The short answer, which I'll elaborate on a bit, is to select your data randomly.
00:16The more random you can make your sample, the more valid your results will be.
00:20So, for example, surveying customers who enter your store during a particular
00:23hour could significant bias your data by focusing on a demographic group that
00:27frequents the stored during those times.
00:29Another way that you could work with your customers and select them randomly
00:33is by selecting, say, every fifth customer, which I will do with my customer list here.
00:38Now I have a series of 10 rows with 10 customer IDs, and what I want to do is to
00:46select every fifth customer. And that means that I want to find any row with a
00:50number that is divisible by five.
00:53That's the easiest way to select every fifth value.
00:56So what I'm going to do is create an IF formula, =IF, and then a left
01:01parenthesis and then within the If formula, I want to find a way to test if the
01:06row number that I'm in is evenly divisible by 5.
01:10If that's the case, then I will treat it as every fifth value.
01:14And the way that you do that is you use MOD.
01:17MOD returns the remainder after a numbers divided by another number.
01:22And what we're going to be testing for is whether that remainder is zero.
01:27So we have MOD, and then I'm looking for row number.
01:29So I'll type a left parenthesis and type in ROW.
01:33That returns the row number of a reference.
01:35In this case, it will be within another set of parentheses, and that will be cell A2.
01:42And I will leave the cell reference relative, so that it will change when I
01:46copy the formula down.
01:47I close the TOW function with a right parenthesis. Type a comma, and now I
01:52can put in my divisor.
01:54And again, I'm working in the MOD function which needs a number, which is the row number.
01:59And then I need the divisor, and that is the number 5, because I want every fifth row.
02:04Type a right parenthesis. And I'm checking to see if the row is evenly divisible by 5--
02:11in other words, that the modulo or modulus is equal to zero.
02:17So I type "=0,", and now I'm back in my IF function so I need
02:23to tell Excel what to do if the value is true, and I'll just have it display the value true.
02:28So, I type "True". Type a comma, and
02:33now what to do if the value is false, and I'll just a display false.
02:37So I put False in double quotes. Everything looks good. Type a right parenthesis
02:42to close the IF function, press Return, and Excel correctly returns the result
02:47that this row number of row 2 is not evenly divisible by 5.
02:51Now, that doesn't mean the formula is working correctly, but if I copy it down
02:55by dragging the fill handle until it covers cell B11, we'll see that row 5 is
03:02True and row 10 is True,
03:03so my formula is working correctly.
03:06If you want to select every fifth customer, that's one way you can do it.
03:09Now technically, my first row in this table is a heading.
03:14So that means that the first customer I select will actually be the fourth customer,
03:18so you have customer 4, customer 9, and so on. But the important point is that
03:23we are selecting every row number that is divisible by five.
03:27So after the first one, which is the forth, every fifth customer after that will be selected.
03:32Now let's say that you wanted a probability that an individual customer would be selected.
03:37So instead of every fifth customer, let's say that you want a 25% probability
03:41that any individual customer would be selected.
03:44You can do that, and I will replace the formulas here.
03:48So I'll type "=IF(", and I'm going to use a RAND value, and
03:54that is a random value between 0 and 1, and it's a decimal value.
03:58So Rand(). And I'm going to check to see if the value is <0.25, because we
04:08want a 25% probability.
04:10Type a comma. And if the value is true, I'll have it display True, so I am going to put
04:15True in double quotes.
04:18And then if is value is false, say False so we have a double quote, False, close
04:23the double quotes. Right parenthesis. Everything looks good. Press Return and
04:29this value happened to come up False this time.
04:31If I drag the fill handle from cell B2 down, you'll see that I had three values
04:38out of 10 that are true.
04:40If I recalculate the worksheet by going to the Formulas tab and clicking
04:44Calculate Now, then you'll see that the random value has changed.
04:50If I want to keep the values that I have right now, I can select them, and then
04:56on the Home tab of the Ribbon, copy--I can also press Ctrl+C to copy--click the
05:01Paste button's down arrow and click Paste Values.
05:04Now, if I were to recalculate the worksheet--and I'll press Escape to undo the
05:09selection here--click Formulas, Calculate Now nothing changes, because these are
05:14no longer formulas they are actually the values.
05:17That's a useful technique you can use to select random values from a relatively
05:21short list, and by short in this case, I mean maybe several hundred values.
05:25When you get lists beyond that, which are very common in business, then it can get
05:30a bit unwieldy to use this technique.
05:32So instead what you can do is have what's called a trigger value, which looks for
05:37a particular instances, such as a customer who orders at least six bottles of
05:42jalapeno-infused olive oil. And then you select the next order. Perhaps you would
05:47give that customer survey,
05:48you would offer that customer a coupon, something like that, but the idea that
05:52I want to get across is that you can use a trigger event to select the next order that occurs.
05:58You never know. There might be a specific group of customers that orders six bottles of jalapeno
06:02olive oil at the same time, and you don't want to focus on customers who have
06:06something in common.
06:07You want to select randomly.
06:09So the formula we're going to create is a little more complicated.
06:12It is a nested IF function.
06:14That means we're going to check for a condition in the IF function, and then if
06:18it isn't true, we're going to create another IF function within it.
06:22So I'll type "=IF(", and now I need to set up my logical test. The cell that I'm
06:29going to check is cell E1, and I will explain why it's E1 and not E2 in a second.
06:35And because I'm creating a nested IF function, I need to test for the opposite
06:38condition of what I actually want.
06:40So in other words, I want to find rows that contain jalapeno, but because the
06:45second IF function needs to go in the value_if_false argument area, I need to
06:50say that E1 is not equal to jalapeno.
06:54So to type not equal, you hold down the Shift key and type a less than sign and
06:59then a greater than sign.
07:01Now, I can indicate the value jalapeno by typing a double quote, because it's a
07:06text string J-A-L-A-P-E-N-O and close the double quote. And now, I can type a
07:13comma. And the reason that I am using cell E1 instead of cell E2 is because I
07:19want my formula to display true for rows where the order previous to it was for
07:25at least six bottles of jalapeno-infused olive oil.
07:28So in other words, I want this row with the garlic 4 values, because that is the
07:35row directly below jalapeno with 6.
07:38So to do that, if I want this cell to be true, I need to check cells E8 and F8.
07:43So I just start doing that here at the top.
07:45So I have jalapeno, and now what to do if the value is true.
07:50In other words, if it's not equal to jalapeno, well, that's not what I want, so I
07:54will have it display False.
07:55To do that, I type a double quote the word False and then close the double
07:59quotes, now a comma. Now I'm in the value with false argument of the IF function.
08:04So I can type another IF, left parenthesis, and I'm checking if the value in cell
08:10F1 is greater than, or equal to, 6. Type a comma.
08:15What do I do if the value is true?
08:17Well, have it display true.
08:18So I type a double quote the word "True" double quote again. Type a comma.
08:23And if the value is false, in other words if it is jalapeno, but it's not
08:28greater than or equal to six bottles, then I have it display False.
08:31So I type a double quote False double quote.
08:35So I have my second IF function with my right parenthesis.
08:38I need another right parenthesis to close the first one. Press Return and I
08:43get my result of False.
08:44Now if I copy the formula down to cover cell G11,
08:48so I'll grab the fill handle at the bottom-right of cell G2, copy the formula
08:53down, and you'll see that row 9-- this order here garlic 4--is correctly
09:01identified because it is the order immediately following an order for jalapeno
09:06with at least 6 bottles.
09:08Selecting sample values at random is harder than it sounds, but be sure
09:12that your selection method doesn't introduce a hidden bias that can skew your results.
09:16Remember, random is best.
Collapse this transcript
Inferring a population mean from a sample mean
00:00Companies often interact with data sets that are difficult to
00:02summarize effectively.
00:04For example, the Two Trees Olive Oil Company might have shipped several hundred
00:07wholesale orders over the past few months,
00:09but you want to analyze the sample of the values, and try to infer information
00:14about the entire order population.
00:16In this movie, I'll show you how to analyze the average value of a sample, and
00:21use that estimate to make judgments about the average of the entire population.
00:24As a note, this technique only works for data sets with 30 or more values,
00:28so you need to have a large sample size.
00:31The first step is to find the mean, or the average, of the selected sample values.
00:36To do that, in cell D3, type the formula: =average--
00:42and average and mean are the same thing. Average and then the name of the
00:46table. On the left side of the screen is Orders, so I type in Ord and then Tab
00:52to accept the formula AutoComplete reference, then a left square bracket, and
00:57the table column we want is Total,
00:59so I just type a T and then press Tab to accept the AutoComplete.
01:03Right square bracket and then a right parenthesis to close the formula, and press
01:08Enter, and we see that the average of the sample values is 4,899.
01:13Next, we need to calculate the standard deviation of the sample.
01:17We sampled 50 values from a list of several thousand orders,
01:20so we will use this formula. It's =stdev,
01:25which is standard deviation for a sample, left parenthesis, and then we want the
01:30standard deviation of the values in this table.
01:34So I'll click the Totals header when my mouse pointer is a downward-
01:37pointing black arrow.
01:38That will select the entire column. And right parenthesis, that ends that
01:43portion the equation.
01:44Now divide it by the square root, and that is sqrt.
01:48That's the square root function in Excel. Left parenthesis. And then what we
01:53want is the count of the orders in this column.
01:57So square root, then we use the COUNT function. Another left parenthesis. And we
02:02are counting the number of entries in the Orders table's Totals column, so
02:07select the column again, and now a right parenthesis to close that portion of
02:14the equation that was the COUNT function.
02:17Now another right parenthesis, and that closes the square root function.
02:21So we have the standard deviation divided by the square root of the count of
02:24the number of orders in that table column, press Enter, and we get the value of 254.896.
02:30Now we can apply the central limit theorem to our data.
02:33The central limit theorem states that when you select simple random samples of
02:37size n from a population, the sampling distribution of the sample mean can be
02:43approximated by a normal probability distribution as the sample size becomes
02:47large--and by large, we mean 30 or more entries.
02:51Stated more simply, what that means is that when you have a sample size of 30
02:55or more values, you can assume that the entire data distribution is
02:59distributed normally.
03:01That is what makes it possible to use these techniques.
03:04So now we know the mean and the standard deviation.
03:07So we could ask questions about the data.
03:09For example, how likely is it that we got a sample of size n--in this case
03:13size 50--with a sample mean between 4,700 and 5,200, if in fact the population mean
03:20is actually some other value.
03:23To find that out, you can calculate the probability that a value is less than
03:27the upper bound, in this case 5,200, by creating this formula in cell D6.
03:33So you type "=normdist", and this is the Normal Distribution function, so that's
03:40why we need to be able to assume that this data is distributed normally, value x,
03:45which is 5200, the mean, which is in cell D3, comma, then the standard deviation,
03:54which is in cell D4, comma, and then finally, we need to indicate whether we want
03:59it to be cumulative or true.
04:02Cumulative would find the probability that the population mean is any value less than 1,500.
04:08If we were to say false, in other words that we wanted the probability density
04:12function or the individual value function, then that means we would be
04:16calculating the probability that the population mean is exactly 5,200.
04:20That's extremely unlikely.
04:22It would be a very low probability.
04:23So instead, what we want to do is discover if the population mean is less than
04:285,200, or the probability that it is.
04:31So we can type "True", right parenthesis to close, press Return, and we get a
04:38probability of about .88,
04:41so about an 88% probability that the population mean is below 5,200.
04:45Now let's do the same thing to discover the probability that the population
04:49mean is less than 4700.
04:52To do that, type a similar formula. It's =normdist(4700.
05:00The mean is in cell D3, comma, standard deviation, D4, comma and again True. So we
05:06get the cumulative function or probability, in other words, the probability
05:11that the mean is less than 4,700. Right parenthesis. Return. And that
05:15probability is about 211/2 %.
05:19If we want to find the probability that the population mean is between 4,700
05:23and 5,200, we can subtract the lower bound, in this case 4,700, from the upper
05:28bound, which is 5,200.
05:30To do that, we create the formula =D6-D7. Press Return and we get a
05:38probability of about two-thirds, about 66.4% that the population mean is
05:43between 4,700 and 5,200.
05:46In this case, it is about two- thirds likely that the true mean value of
05:50wholesale orders is between $4,750 and $5,200.
05:54If you were to change the values in the formulas, you could determine the
05:57probability that the mean is between 4,500 and 5,400, and so on.
06:02Once you know those probabilities, you can use your domain knowledge to assess
06:06your information and make decisions about your business.
Collapse this transcript
Inferring population distributions based on a proportion of the data set
00:00In the previous movie, I showed you how to estimate the mean of population data
00:04based on the mean of sample values taken from that population.
00:07In this movie, I'll show you how to infer some other characteristics of
00:10the population's data distribution, such as the population proportion,
00:15given a similar sample.
00:17Our goal is to find the expected value and standard deviation of all possible
00:21samples of a given size from the population.
00:24In this case, I'll use a sample size of 50, which is taken from a population of 3,000 values.
00:30The business case, or business scenario, is that we know there was 25% chance that
00:36any individual visitor to your web site would be offered a coupon.
00:40So the expected value is 0.25.
00:44So in other words, if I were to visit the Two Trees Olive Oil Company's web
00:46site, there is 25% chance that I would be offered that coupon on this visit.
00:53I might see it again, but that's the probability per visit.
00:57First thing we need to do is calculate the standard deviation.
01:02To do that, we can use the number of sample values and the expected value.
01:06And we can use the expected value instead of taking the mean from another data
01:11set, because we know that the average should be 25%, so we can just use that.
01:16So the formula for the standard deviation is
01:18=sqrt(, the value in B4, which is the expected value, multiplied by,
01:30the left parenthesis, 1-B4, right parenthesis, divided by the value in B3,
01:37which is the number of sample values.
01:39Type a right parenthesis to close the formula, press Enter, and we get the
01:45standard deviation of 0.06. And that make sense;
01:49our expected value is .25, so it makes sense that our standard deviation would
01:54be a lower value of 0.06.
01:57Now we can calculate the probability that the number of customers who were
02:00actually offered the coupon would be 28% or less.
02:04To do that, we click in cell B8. Equal sign. And because our data, we're assuming is
02:11normally distributed, we can use the NORMDIST function. Left parenthesis.
02:18And then the value we are testing is x, and that is 0.28.
02:23The mean is the expected value. That's in cell B4.
02:27The standard deviation is in B6.
02:30We just calculated that. And then we need to decide whether we want the
02:34cumulative or the point estimate for 28%. In this case, we want 28% or less, so
02:40we are interested in the cumulative probability.
02:43So I'll type "True". Right parenthesis to close the formula, Return, and we see
02:49that the probability is 0.68.
02:51So in other words, it's about 70% likely that 28% or fewer customers would have
02:57been offered the coupon.
02:58Now we can calculate the same probability for 22% or less.
03:03To do that, we create a similar formula. It's =NORMDIST, and the value we want to test is .22.
03:12The mean is the expected value in cell B4, standard deviation is in cell B6, and
03:19we are interested in the cumulative probability, so it's "True". Right parenthesis,
03:24press Return, and we get the probability of 0.31.
03:28So now we can create a formula in cell B11, which calculates the probability that
03:34between 22% and 28% of our customers would've been offered the coupon.
03:39So to do that, we just subtract the 22% or less area from the 28% or less area.
03:47So its =B8-B9. Press Return and we get the value of 37%.
03:56Based on this analysis, we can see that it's about 37% likely that between 22%
04:01and 28% of our customers will have been offered the coupon.
04:05If we want to take that range higher or lower, we can change the expected value.
Collapse this transcript
Estimating the size of a population given numbered samples
00:00Businesses are always trying to get information on their competitors, whether by
00:04shopping their stores or using public information to make guesses about their
00:07performance and intentions.
00:09In this movie, I'll show you how to estimate the size of a population if you
00:12have samples of a sequentially numbered item, such as coupons distributed to
00:17a store's customers.
00:18So when might you want to estimate the size of a population?
00:21Well, in a competitor intelligence situation, when you discover that
00:25your competitor is giving out coupons-- and you can determine which one in
00:29the sequence you get--
00:30you can use that information to estimate the total number of coupons handed out.
00:34You could do the same thing for contests, entries, raffle tickets, and so on.
00:37So now let's say that in this situation your competitor is giving out coupons--
00:42and just for the sake of this demonstration, I will let you know the actual size is 1500--
00:48let's say that you have five coupons that you found, and you've identified a
00:53number that you know represents each coupon within the sequence.
00:57In other words, they have a number from 1 to 1,500.
01:01So what you want to do is find a formula that will somehow estimate the number
01:05of coupons in the group.
01:06So let's say that you have one row of data--coupon 1 through coupon 5--and the
01:13estimator is the formula that I'm going to create in a moment.
01:16So let's say that each coupon will be numbered between 1 and 1,500, so I'll type
01:21"=RANDBETWEEN" and press Tab to accept the formal AutoComplete, and then in the
01:28formula I will have the lowest number of 1 and the highest number 1,500, and
01:34type a right parenthesis to close the formula.
01:37Now when I press Enter, Excel is going to fill this formula down, throughout the
01:42rest of the column, and I do that just so I can show you the average of all of
01:46the estimates at the end of the exercise.
01:49We're actually only going to be concerned, as I'm talking, with the first row.
01:54So we have RANDBETWEEN 1 and 1,500. Press Enter and there we have it, and I
01:58will create the same formula =RANDBETWEEN(1, 1500), and the same. And I'm not copying it.
02:10In other words, I am not dragging the fill handle from cell A6 to B6 to C6,
02:15because if I did that, then Excel would not populate the rest of the table
02:19column with the formula.
02:21You have to type it in to get it to do it properly.
02:23So I'll type "=RANDBETWEEN(1,1500)" and the final one "=RANDBETWEEN(1,1500)",
02:37and there we have it.
02:38So these are the values that I will be working with. And I'm going to select
02:43these table columns, and I did that by moving my mouse pointer over the Coupon1--
02:49that's the leftmost table column header-- and then when the mouse pointer changes
02:53to a downward-pointing black arrow, clicking, and then I held down and dragged
02:57to select columns 1 through 5. So we are there.
03:01Now, on the Home tab, I'll copy the data.
03:03You could also press Ctrl+C to copy. And then on the Paste button list, I'll
03:09click Paste Values, and now instead of the formulas, I have just the individual values.
03:14So I'll press Escape to release the selection, and now I can start thinking
03:19about creating the formula.
03:21There is no built-in formula in Excel that you can use to estimate a population
03:26given the series of numbered samples.
03:27However, there is a formula that you can use that was developed during World War
03:30II that allows you to do it.
03:33The first thing you need to know is the number of coupons. So for that we have 5
03:38coupons, 1 through 5, so I will just type in 5.
03:41If you didn't know how many coupons were available to you or how many coupons
03:46that you had--at least not in advance-- then you could do a COUNT function or
03:50something, but in this case it's straightforward, so I'll just put in the value.
03:54Now that you have that value, you can create your formula.
03:57So I'll go to cell F6 and type "=((" and then "H2"--
04:05that's the number of coupons--and when Excel copies the formula down, I don't
04:09want that cell reference changing, so I'll press F4 to make it an absolute
04:14reference, and then "+1".
04:17So in other words, I am taking the number of coupons+1.
04:19Then we have the right parenthesis, and I will divide it by the number of coupons.
04:25Again, I don't want that cell reference changing,
04:27so I will press F4 to make it an absolute reference.
04:31So what I've done on this side of the equation--type in a right parenthesis--is
04:36to divide the number of coupons +1 by the number of coupons.
04:41So in this case, it would be 6/5 or 1.2.
04:44Now the second thing you do is you multiply-- typing an asterisk and then a left parenthesis--
04:51you find the maximum value in your range, so we have max( and then the range of
04:58A6:E6), so that's the maximum value. And then you subtract one. And over time
05:07statisticians have just discovered that this is the most accurate estimator
05:12that they can create.
05:13All I need to do now is type a right parenthesis, and that will allow me to
05:18complete the formula. Pressing Enter. And Excel displays all of the estimations,
05:24given these five coupons.
05:26Now the first row has a number of extremely small numbers.
05:30So you have got 295, 230, and so on.
05:32The highest number, in fact, is 729.
05:36So what that means is that your estimate is going to be low, much lower than
05:391,500. And in fact, it's 873.
05:42If you look at the other rows--just the ones visible on this sheet--right now,
05:46you'll see that many of the estimates are pretty close: 1,536, 1,545, 1,166--not
05:52that close--1,400, 1,744 and so on. But now let's see what happens over time;
05:57in other words, for every value in this table, what is the average of the estimate?
06:02To do that, type "="--and we are in cell H3--"average(", and then we select the
06:10Estimator column of the table, move the mouse pointer until it's a downward-
06:14pointing arrow, over Estimator column header, click it, and Excel places the
06:19reference in the formula. Type a right parenthesis and hit Return, and you'll
06:25see that the average estimate over time is 1,496 and about three quarters, and
06:31that is extremely close to the actual population size of 1,500.
06:36So it's possible that you get a series of bad samples.
06:39For example, if you were working with the data in this first row, you might
06:42think maybe there are only a thousand coupons and I just found a bunch of the
06:46lower numbered ones.
06:47However, if you went to the second row and had that data, then you would have a
06:51fairly accurate estimate.
06:52You would figure that 1,500 to 1,550 was about right.
06:56But again, over time this method of estimation, as you can see from the average
07:01of all the values in the table, is extremely accurate.
07:04Estimating a competitor's customer base is just one of the ways that you can
07:07analyze their performance using statistics.
07:10If you want to avoid having this sort of analysis performed at your expense,
07:13don't issue sequentially numbered coupons.
Collapse this transcript
Calculating a confidence interval for a large sample
00:00If you've collected sample values from a population, you can make inferences
00:03about the population's mean, or arithmetic average, based on those values.
00:08In this movie, I'll show you how to calculate a confidence interval when your
00:12sample set contains 30 or more values.
00:14In the next movie, I'll show you how to calculate a confidence interval when you
00:17have a small sample with less than 30 values.
00:20A confidence interval is what you get when you want to say, "I'm 95% sure that--."
00:27So in other words, you want to have a level of certainty based on your
00:31statistics that allows you to say I am 95% certain that I'm going to see between
00:3645 and 75 customers in my store tomorrow.
00:39So I am going to show you how to calculate the confidence interval using this
00:44Order data table, which has the total of orders. And my goal is to find the range
00:50in which I'm 95% confident that a particular day's orders will fall.
00:55First thing we need to do is find the number of measurements and for that, I'll
01:00just type "=COUNT(", and now I can select this table column.
01:07I move my mouse pointer over the column header, and when it's a downward-pointing
01:12black arrow, I click, and that selects the entire column.
01:15So it's State(New York. Press a right parenthesis to close the formula, Return,
01:22and I see that have 44 measurements.
01:24Now to find the average, or the mean, I can type "=average"--and this formula is
01:29going in cell D4--left parenthesis. And my mouse pointer is over here already,
01:34so I will just select the same table column, press a right parenthesis, press
01:40Return, and I see that the average is 135.14.
01:44Now to calculate the standard deviation--in this case I'm just using the
01:48standard deviation of the sample--
01:49so I'll type "=STDEV"--
01:52that's a standard deviation. For a sample, standard deviation key would be the
01:56formula that I'd use if I had an entire data population, but I don't;
02:00I just have the sample. So, left parenthesis, and select the table column again,
02:05right parenthesis to close, and I have a standard deviation of 65.55.
02:12Next I need to specify my confidence level. And as I stated earlier, I want to
02:17be 95% confident in my calculations,
02:20so I will type in 0.95;
02:23that's equivalent to 95%. Press Return.
02:26The significance level is the confidence level subtracted from 1.
02:31So the formula to calculate that is =1-D7.
02:38Press Return and we get 0.05.
02:41Next, we need to calculate the z-score for our particular confidence Level.
02:45The z-score is the number of standard deviations from the mean that a value occurs.
02:51So in this case I am looking for the 95% confidence Level, so I want to find the
02:56z-score within which 95% of the values will occur, plus or minus the mean.
03:01To calculate that, I type the absolute value of normsinv,
03:09which is the inverse normal distribution probability function, and the
03:14significance level is D8, and I divide that by 2--
03:19again, because the values that I want can be either plus or minus as compared to the mean.
03:24Right parenthesis to close the internal normsinv function, right parenthesis to
03:30close the absolute value function, press Return and I get a value of 1.96.
03:35And as I noted in the previous movie, 1.96 is the number of standard deviations from
03:40the mean, plus or minus that 95% of your values will occur.
03:44Now we can calculate the standard error.
03:46Standard error is calculated using this formula.
03:50It's equal the standard deviation, which is in cell D5, divided by the square root of the
04:00number of measurements, and that is in cell D3. Right parenthesis to close,
04:05press Return, and we get a standard error of 9.88.
04:08Now we can calculate the margin of error.
04:12The margin of error is the z-score multiplied by the standard error.
04:16In other words the standard error is the amount of error expected per
04:20standard deviation.
04:21Our z-score is 1.96, so we don't have one standard deviation.
04:26We need to multiply those two values together.
04:29So we have =D9*D11, press Return, and we get our margin of error of 19.37.
04:40Now we can add or subtract the margin of error from the point estimate, which is
04:44the sample mean, and we find the mean in cell D4,
04:48so it's =D4. Press Return, 135.14, and now we can calculate the lower limit,
04:55which is =D14 minus the margin of error, which is in D12--press Return--and the
05:03upper limit, which is =D14+D12, Return, and we see that the range of our values,
05:12with 95% confidence, will fall within the range of 115.77 and 154.5. Within the
05:20business case, that means with 95% confidence that in order from the customer
05:25who lives in New York will be between $115.77 and $154.50.
05:33You can use this information to make judgments about your customers, perhaps
05:36offering the special deals, or if this range is higher than the average amount of
05:40orders from customers and other states, you can pay more attention to them.
Collapse this transcript
Calculating a confidence interval for a small sample using t-tests
00:00In the previous movie, I showed you how to calculate a confidence interval
00:03for a population when you collected more than 30 sample values.
00:06In this movie, I'll show you how to perform the same type of calculation for a
00:10small sample, which has, by definition, less than 30 values.
00:14Because you're working with a small sample size, you will use the t distribution
00:17to analyze the values.
00:19One of the simplifying assumptions that you make when you work with a small
00:22sample is that your data has a normal distribution.
00:26So the first thing we need to do is calculate the number of measurements, and
00:30that's here in this table.
00:31So in cell D3, I will type "=COUNT(", and then I'll click the
00:39column header for the table column, move the mouse pointer over the column
00:43header, click it, Orders Totals, looks good, right parenthesis to close the
00:48formula, and Return, and we see that we have 12 values.
00:51Now we can calculate the mean.
00:53For that its =AVERAGE, and then select the same table column, Orders Totals,
01:02right parenthesis, Return, and there is the average.
01:07Finally, we can calculate the standard deviation, and we create exactly the
01:10same type of formula, except we use the STDEV function, standard deviation for a sample.
01:16Left parenthesis and once again select the table column, right parenthesis and
01:22Return, and we see our standard deviation of 65.98.
01:27Next, we can type in our confidence Level, and we want to be 95% certain, so I'll type .95.
01:35Our significance level is one minus confidence level,
01:39so we have =1-D7, and Return.
01:44And next--and this is the measure that we didn't have for the large sample,
01:47but when we work with t-scores, we have to work with degrees of freedom.
01:52And degrees of freedom is the number of measurements minus 1. So we have =D3-1.
02:01Finally, we can calculate our target t-score for 95% confidence level with 11 degrees of freedom.
02:08To do that, type = and use the t inverse function, which returns the inverse of
02:14the t distribution, left parenthesis, D8, which is our significance level,
02:18divided by 2, and we are dividing by 2 because we can have a value that's either
02:24higher or lower than the mean.
02:26If it were only higher or only lower, then we would not divide by 2.
02:29But we do, so we need to divide the significance by 2, comma, and degrees of
02:35freedom, which are found in cell D9. Right parenthesis to close the formula,
02:40Return, and we see that our target t-score is 2.59.
02:45Now we can calculate our standard error.
02:47And standard error is calculated using this formula: = standard deviation, which
02:52is D5, divided by the square root of the number of measurements, and that is
03:00found in cell D3. Right parenthesis and press Return to accept the formula.
03:08Now that we know these values, we can calculate our margin of error, which is the
03:11t-score, or the number of standard deviations from the mean that a particular
03:15confidence level is, multiplied by the standard error.
03:19So that is =D10*D12. Press Return and we get our margin of error of 49.39.
03:29Now we can use our point estimate, which is the mean--and that is =D4--to
03:36calculate our upper and lower limits.
03:39To do that, the lower limit is the point estimate, which is in D15, minus the
03:44margin of error, which is in D13, and the upper limit, which is the point estimate
03:51in D15, plus the margin of error, which is in D13.
03:58Press Tab and you see that we have a lower limit of 105.78 and 204.56.
04:04Because we are working with fewer than 30 sample values, the estimate of
04:08the range we're 95% certain the population mean falls within must be more conservative.
04:13In the large sample case, the spread between the upper and lower limits was about $39.
04:18In the small sample case, which is based on the same overall data set, the
04:22spread is almost $100.
Collapse this transcript
Calculating a confidence interval given a population proportion
00:00Unless you're a large, well-funded organization that can afford a staff to
00:03conduct surveys for you, you'll often have to work with the data you and your
00:07existing staff can pull together.
00:09One way you can analyze your data is to calculate a data range, or a confidence
00:13interval, that a given value falls within at a given level of certainty.
00:17In this movie, I'll show you how to calculate a confidence interval at the
00:2095% confidence level.
00:22The data we are working with is a series of orders by state.
00:26And instead of working with the numerical values, I've listed the states from
00:29which the orders come.
00:30We are interested in analyzing the proportion of results of orders that
00:34come from the State of New York, so that's why we are working with a
00:38population proportion.
00:40First, we need to find the number of measurements in our table.
00:44To do that, we create a COUNTA formula. So it's =COUNTA.
00:50COUNTA counts the number of cells in a range that are not empty.
00:52If we used COUNT, it would look for numbers only, and it would return 0.
00:57Type a left parenthesis, and then I will select the column in the table, move
01:02the mouse pointer over the column header, and when it turns to a downward-
01:06pointing black arrow, click, and we get the table reference.
01:11I will type a right parenthesis to accept it, Return, and we see that we have
01:14372 total measurements.
01:16Our target state is New York, but what we want is the number of occurrences of
01:22New York within that table.
01:24To do that, we use a COUNTIF formula.
01:26So it's =COUNTIF( and then the range we're examining, and that is our table
01:34column again, so I will select it. And next, we need to know the state that we
01:39are looking for, and that is in cell D4. It's New York.
01:44Type in a right parenthesis and press Return, and we see that we have an
01:48occurrence of New York of 259, so there are 259 New Yorks within that table column.
01:54Now we can calculate the proportion, and that is simply the number of
01:59occurrences of New York, which are in D5, divided by the total number
02:04of measurements in D3.7. I had already filled in our confidence level
02:09and significance level.
02:10The confidence level is 0.95. That means we want to be 95% confident in our results.
02:16The significance level is 0.05, and that is simply one minus the confidence level.
02:22Now we can calculate our z-score, and the z-score is the number of standard
02:27deviations from the mean, plus or minus that a certain number of values will occur.
02:31So in other words, we need to find the z-score within which 95% of our values will occur.
02:37To do that, type equal, absolute value value, because
02:41I don't want to work with negatives. left parenthesis, n-o-r-m-s-i-n-v. This is the normal inverse
02:49distribution, (1-D9/2. And we divide it by 2 because we can be plus or minus the mean.
03:01So significance can be greater than or less than.
03:03If we were only interested in less than, we wouldn't divide it by 2.
03:07Type a right parenthesis to close the internal formula--
03:10that's the normsinv function--
03:14another right parenthesis to close the absolute value, Return, and we get a z-
03:18score of 1.96, which we know from past experience is the number of standard
03:23deviations for the 95% confidence level.
03:26Now we can calculate our standard error, which is the amount of error we would
03:30expect from one standard deviation.
03:32To do that, we type Equal, and there we are taking a square root of the
03:36number of measurements, which is in cell D3, divided by the D6--the sample proportion--times (1-D6.
03:43So in other words, we are multiplying the sample proportion by one minus
03:52the sample proportion.
03:54So 0.7 by 0.3. Now we can add a right parenthesis to close the formula.
03:59Everything looks right.
04:00And we have our standard error of 12.74.
04:04Now we can calculate our margin of error, which is the z-score, the number of
04:08standard deviations away from the mean that our data will be if it's within the
04:1195% confidence level.
04:13So we multiply =D10*D12, Return, and we get our margin of error, 24.97.
04:25Now we can have our upper and lower limit based on the number of measurements,
04:28in other words the number of orders from New York, that we would expect from the
04:32entire population based on our sample.
04:34So first we have our point estimate, which is the occurrences of New York, and
04:38that is in cell D5, and our lower limit we'll be entering into cell D16, and that
04:46is the point estimate minus the margin of error.
04:49So it's =D15-D13, Return, and then our upper limit is the point estimate plus
04:59the margin of error.
05:00So it's =D15, the point estimate + D13, the margin of error. Press Tab. And we
05:08see that based on our proportion, we can be 95% certain that we will have between
05:13234 and about 284 orders from customers in New York.
05:18You can use this type of analysis to make judgments about your customer base.
05:22Because about 70% of your orders come from New York, you can decide to offer
05:26more incentives to customers in other states, or you can try to develop your
05:30existing customers by offering deals to get them to purchase more products.
Collapse this transcript
Calculating sample size required for a given confidence level
00:00Political polling often seems like a dark art.
00:03When a news organization reports that a poll shows candidate A leads candidate
00:06B by 8% with a 3% margin of error-- at the usually unstated 95% confidence
00:13level--you might wonder how many potential voters they need to survey to get
00:16that level of accuracy.
00:18In this movie, I'll use a business example to show you how to calculate the sample
00:22size required to reach a given confidence level.
00:25To calculate the sample size, you need the following values:
00:28the confidence level--and in this case we want to be 95% confident--
00:32the significance level, which is one minus the confidence level or in this case
00:365%, your z-score, margin of error, sample proportion and sample size.
00:43To calculate the z-score, which is the number of standard deviations from the
00:46mean, we use the following formula.
00:48It's =normsinv, which is the standard normal curve inverted, so we are
00:54calculating a z-score. Instead of using a z-score to calculate a value, left parenthesis, and it's one
01:01minus the significance level, which is in C4, divided 2.
01:06And we divide by 2 because we want to take the value plus or minus.
01:11In other words, if it's from the mean, it can be higher or it can be lower.
01:15We don't care which direction.
01:17If we only cared about higher or lower, then we would not divide the
01:21significance level by 2.
01:22But this is a two-tailed test, so we are dividing by 2. Type a right parenthesis
01:27and press Enter, and we see the z-score of 1.96, which from previous experience
01:32we know is the proper z-score for a confidence level of 95%.
01:37Now we can enter our margin of error.
01:39In other movies, we've calculated the margin of error, but in this case we
01:42want to specify it, because we're going to use that when we calculate the
01:46required sample size.
01:48So our margin of error will be 0.03, which is equivalent to 3%.
01:54Next we are going to calculate our sample proportion.
01:56And this term isn't exactly obvious, so I'm going to give it a little bit of explanation.
02:02In the next formula, for sample size, we're going to multiply the sample
02:07proportion by one minus the sample proportion.
02:10So I will just work here in a separate cell.
02:12So let's say that we have the value of 0.7 for a sample proportion.
02:18The sample size formula, which I will do separately, would multiply, so I'll do =E8*1-E8.
02:30So that would be 0.21 because you are multiplying 0.7 by 0.3, which is 1-E8.
02:37So the question is how does that sample size calculation formula use the
02:43sample proportion value?
02:44Well, first off, it multiplies it by one minus itself. And if you want to make
02:50the most conservative estimate possible--
02:52so, for example, in the case of political candidates, if you believed that
02:5670% of voters would vote for candidate A, then you would choose a sample proportion of 0.7--
03:04if you want to make the most conservative estimate possible, you should assume
03:08that your sample proportion is 0.5,
03:10in other words, that your population is distributed evenly between the
03:15two possible options.
03:16And I'll show you why that's the case.
03:18If we change the sample proportion to 0.5 and press Tab, you'll see that
03:24the value in cell F8 is 0.25, and that is the largest possible value that you can have.
03:31So, for example, if I change the population proportion to 0.52--still very
03:36close to half--but when I press Tab to accept the value, we see that the sample
03:40proportion multiplied by one minus the sample proportion is less than it would
03:45be if the sample proportion were still one half.
03:47So to make the most conservative estimate we can, we will make the population
03:51proportion one half. So type in .5.
03:56Now we can enter our sample size formula.
03:59And this formula is =C5, which is the z-score, squared, so caret or capital6
04:082*C8, which is the sample proportion, times, left parenthesis, 1-C8.
04:18Again, we are multiplying the sample proportion by one minus the sample proportion.
04:23And to be as conservative as possible, we need to make that value as large as possible,
04:27so we select 0.5. Then we divide this entire term by the margin of error, which
04:33is in cell C7, squared.
04:36So caret, or capital 6, 2.
04:39With that formula in place, you can press Return,
04:42and we see the sample size required of 1,067 persons.
04:46Now let's see what happens if we change our confidence levels and significance
04:50levels and also the required margin of error.
04:52So for the confidence level of 99%, if I'll just change this value in C3 to 0.99, we
05:00see that we need a sample size of 1,843 people.
05:04If we want to change our margin of error to 2%, which is 0.02, then in cell C7 we
05:10would type 0.02 and press Return.
05:13And you see that the sample size goes up quite significantly.
05:16Now you might be wondering, why haven't I talked about populations?
05:19In other words, you know, maybe 100 million or 150 million voters in the United States,
05:24why isn't that figure being used in any of these equations?
05:28And that's because it doesn't matter.
05:30Once your sample size gets large enough, such as in this case where you have
05:33over 4,000 individuals, then with the certain level of confidence--either 95% or
05:3799%--you will know that your survey will reflect your population accurately.
Collapse this transcript
7. Testing Hypotheses
Defining hypotheses and identifying errors
00:00When you analyze your data using statistics, you should know what aspect of your
00:04data you want to test.
00:05The statement that embodies your test is called your hypothesis.
00:09A hypothesis is an educated guess about the characteristics of a data set and
00:13the circumstance it describes.
00:15For example, you could create a hypothesis that says the state a customer lives
00:20is related to the amount of olive oil they order from your company.
00:23When you design a statistical analysis, the first thing you should do is state
00:27your null hypothesis, which says that the fact you're analyzing has no effect on the data.
00:33For example, you could say there will be no difference in the average order
00:36amount based on the state the customer lives in.
00:40So, if a customer lived in New York, then they would not on average order more
00:45than customers that live in Connecticut.
00:47If you're reading about statistics, you might see the null
00:50hypothesis represented as H subzero.
00:53In this case, the zero means that the variable upon which you have formed your
00:57hypothesis has zero effect.
01:01Next, you create an alternative hypothesis, which states that there is a
01:05relationship between two variables.
01:07For example, you could state that the average order of customers from New York
01:11cost more than the average order of customers from Connecticut.
01:14The alternative hypothesis is represented as H sub A, and in this case, the A
01:20stands for alternative.
01:23This alternative hypothesis is what's called a directional alternative
01:27hypothesis in that it specifies that the average values of order from New York
01:31customers is greater than the average value of orders from Connecticut.
01:34A non-directional alternative hypothesis would state that the average order
01:39of customers from New York is different from the average order of customers from Connecticut.
01:44That means under that alternative hypothesis, the New York orders could be
01:48greater than or less than orders from Connecticut, and the non-directional
01:52alternative hypothesis would still be true.
01:54So how do you create a good alternative hypothesis?
01:57Well, there are number of steps that you can go through, and if your alternative
02:01hypothesis meets all four of these tests, then you've probably created one that
02:06can be tested accurately.
02:08First, you need to state explicitly that there is an expected relationship
02:13between two variables.
02:15Second, you need to base your alternative hypothesis on your knowledge of
02:19the world, and this is from looking at the statistics that you have in your possession.
02:24So, for example, if you know that the average order from New York customers is
02:28150 and the average order for customers from New Jersey is 125, there is the
02:32possibility of a difference that you can investigate a little bit more.
02:37Next, you want to express the hypothesis simply and briefly so it's easily
02:42understood, and also make sure that you are only testing one thing;
02:46you don't want to have "and" or "or" anything like that in your
02:52alternative hypothesis.
02:53It should be stated very simply one fact that you are testing for.
02:58And finally, make your alternative hypothesis testable.
03:01If you don't have any data that indicates the amount that New York customer
03:05spend or the amount that New Jersey customer spend, then you can't really
03:09test that hypothesis.
03:10So look at your data, and from that and your knowledge of your business and its
03:14operations, decide what it is that you want to test for.
03:18There are a couple of errors that you can have in hypothesis testing, and those
03:22two errors are called TYPE1 errors and TYPE2 errors.
03:27A TYPE1 error is a false positive, and that is when you reject a null
03:32hypothesis that is true.
03:34So let's assume, for example, the New York customers actually do spend more
03:37than customers from New Jersey when they purchase items from your store.
03:41It's possible that the data that you collect causes you to reject the null
03:45hypothesis that New York customers do not spend more money per order when they actually do.
03:51So that's called a false positive.
03:53And remember, anytime you collect data and perform a statistical analysis using
03:57inferential techniques, you have a confidence level such as 95%.
04:02That means even if your experiment is designed properly, that about 5% of the
04:07time, you will generate an error.
04:10If the error is a false positive, you reject a null hypothesis that is true.
04:14The second type of error, which is a false negative, says that you fail to reject
04:19a null hypothesis that is false.
04:21So in other words, if your null hypothesis is that New York customers don't
04:25spend more per order, and they actually do, but your examination of your data
04:29does not lead you to that conclusion, then you have failed to reject the null
04:33hypothesis, and you have a TYPE2 error.
04:37For a more common example and one that even if you don't work with statistics you
04:40might have encountered, think about a jury trial in the United States.
04:44In the U.S., the null hypothesis is that a defendant is innocent.
04:49So a TYPE1 error is that the innocent person is convicted.
04:53A TYPE2 error is that the guilty person is set free.
04:57Following the four rules of creating a good alternative hypothesis will help cut
05:01down on errors, but always remember that there is a possibility of error when
05:05you deal with inferential statistics.
05:07Anytime you work with data from less than an entire population, there is some
05:11danger that you will reach an incorrect conclusion.
Collapse this transcript
Performing one-tailed tests on a large sample's mean
00:00In this movie, we will use a one-tailed test to examine a sample's average value.
00:06A one-tailed test analyzes data to see if it deviates significantly from a
00:09target value in one direction: either less than or greater than, but not both.
00:15The techniques I am going to show you in this movie assume that we are working with
00:18a large sample, so the data set must contain at least 30 values.
00:22Now the goal is to test the hypothesis that a discount coupon mailed to 30 sets
00:27of 200 randomly selected customers within a 5 mile radius of the Two Trees store
00:32will bring in 18 new customers on average per mailing group.
00:37So my target value, which I have here in cell D6, is 18.
00:42So that's my target value, and we will use that in a moment.
00:45But first, I need to fill in the rest of the values in the worksheet, so that we
00:49can create our formulas.
00:51So first we need to calculate our sample size. That's in cell D2.
00:57So I'll type equal, and we are going to count the number of measures that
01:02we have in our table.
01:04So I will select the table column by moving the mouse pointer over the table
01:08column header, and when the mouse pointer is a downward-pointing black
01:11arrow, click the left mouse button to select it, and you see that it's added to the formula.
01:18When I type a right parenthesis and Return, Excel counts my sample size, which is 30.
01:23Now I can calculate the mean, or the average.
01:26To do that, it's =AVERAGE(, and I select the table column again, and then a right
01:33parenthesis and Return, and that's the mean.
01:37Now I can calculate the standard deviation, and I create exactly the same type of
01:40formula: stdev, because I am working with the sample instead of the entire
01:45population. Left parenthesis and select the table column again.
01:50There it is. Right parenthesis, Return, and there is my standard deviation.
01:55I already have my target value of 18,
01:58so I can move on to computing the standard error.
02:02In this case, the standard error is the standard deviation divided by the square
02:07root of the sample size.
02:09So I have those values in my worksheet already,
02:11so it's =D4 divided by the square root--that's sqrt--left parenthesis, of
02:20the sample size, which is in cell D2. Right parenthesis. Everything looks good,
02:25so my standard error is 1.9885.
02:29Now, I can calculate my z-score.
02:31And when I have this particular set of statistics, I calculate my z-score by
02:37typing =(, and then you have the mean.
02:42You subtract the target value, which is D6, right parenthesis, and you divide that by your
02:50standard error, which is in cell D8. Press Return and we get a z-score of -1.9277.
03:00Now we can calculate the probability using that z-score.
03:03Remember, z-score is the number of standard deviations we are from a mean.
03:07So we have the probability is =1-normsdist--
03:15that's the standard normal distribution-- left parenthesis, and then the z-score,
03:20which is in cell D9. Right parenthesis, everything looks good, Return, and we
03:26have our probability of 0.9731.
03:30Now note that that probability is extremely high, which indicates that we should
03:34not reject the null hypothesis that the coupon mailing would bring in 18 new
03:39customers for mailing group.
03:40And you can see, based on the mean of the number of new customers, that a mean
03:45of 14.167 isn't anywhere close to 18,
03:49so the probability that we have here makes sense when it comes to failing to
03:54reject the null hypothesis.
Collapse this transcript
Performing two-tailed tests on a large sample's mean
00:00In the previous movie, I showed you how to perform a one-tailed, or directional,
00:04test on a large sample data set.
00:07In this movie, I'll show you the same technique, using a two-tailed test.
00:12You use the two-tailed test to check if the average of a sample is within a
00:16given number of standard deviations of the target mean in either direction.
00:20For example, if you sell 12-ounce bottles of olive oil, you want the average
00:24amount of olive oil to be very close to the advertised volume.
00:28As with one-tailed tests, you use the sample size, mean, and standard deviation,
00:33and target value to compute your other statistics.
00:36So let's get those into the worksheet now.
00:39For the sample size, we need to count the number of measures in our table.
00:43So I will type "=COUNT", and this is in cell D4, left parenthesis, and then the
00:51values. And I'm going to select the table column by hovering the mouse pointer
00:56over the column header.
00:58When the mouse pointer changes to a downward-pointing black arrow, click
01:01for left mouse button, and I get the table name of Fill, and the column name of Ounces.
01:07Right parenthesis to close the formula, Return, and we got a sample size of 30.
01:13Now we calculate the mean, or the average.
01:15So for that it's =average( and select the table column again, Fill, Ounces, right
01:23parenthesis, Return, and we see an average of 12.1367.
01:29Now the standard deviation, same type of formula stdev, and we're working with the sample,
01:35so we'll use this one as opposed to the population version.
01:39Left parenthesis, and select the table column again, right parenthesis, Return,
01:44and we have a standard deviation of 0.3.
01:48I filled in our target value of 12 ounces earlier, so now I can go on to
01:52computing the standard error.
01:56To calculate my standard error, I divide the standard deviation,
02:00so =D6, and divide that by the square root of the sample size, which is in cell D4.
02:08So it's SQRT, the square root function, left parenthesis, D4, right parenthesis
02:15to finish the formula, Return, and there is my standard error.
02:20Now to calculate the z-score, I use this formula, and that is equal =(D5, which
02:26is the mean, minus D8, which is the target value, right parenthesis. And we
02:34divide that result by the standard error, which is D10. Press Return and we
02:40get a z-score of 2.462.
02:41Now that's a fairly large z-score,
02:44so that tells us that it's likely that this mean is significantly higher than
02:49it should be if our target value is 12 ounces.
02:53To continue our analysis, let's take a look at our lower tail.
02:57The lower tail will give us the percentage of values under the normal curve that
03:01will be less than this z-score,
03:03the percentage of the curve that is to the left of the score.
03:07So we type "=normsdist", which returns a standard normal distribution, and that
03:14takes one argument, our z-score, which is in D11. Right parenthesis and Return.
03:20We see that 99% of the values under the standard normal curve are to the left of our z-score.
03:27Now to calculate the upper tail, we subtract the lower tail value from 1.
03:32So that is =1-D13. And that value corresponds;
03:40it's a less than 1% of the values above that z-score under the normal curve.
03:45Now we multiply the smaller of those two values by 2 to find the probability
03:49that the value we calculated is in the tails of the curve,
03:53in other words the rejection area outside of where we want the mean to be.
03:58So, to find the lower of the two values, we type "=MIN(D13:D14)", and then we
04:10multiply that value by 2.
04:12Press Return, and we get our answer, so about 1.3%.
04:18So, how do we interpret these results?
04:19Well, the Two-Tail value is less than .05, so we can't,
04:23with 95% confidence, reject the null hypothesis that the mean fill level for the
04:28bottles average is 12 ounces.
04:31What we can do is use the values in the worksheet to create a confidence
04:34interval, which will verify that the target mean of 12 ounces is outside the
04:38range we calculated.
04:40The formula we use multiplies the z-score by the standard error, and then adds
04:45and subtracts those values from the mean.
04:47We're going to use the z-score that corresponds to the 95% confidence level, and
04:52that number is 1.96.
04:53So, to create the upper bound, we add the mean, which is in cell D5, +1.96 times
05:05the value in cell D10, which is the standard error, and press Return.
05:09So our upper bound is 12.2456.
05:12Now we can do the same thing to calculate the lower bound, except that we've
05:16subtract the z-score times the standard error from the mean.
05:21So for this, we type "=D5-1.96", which is the z-score for 95% confidence
05:30multiplied by the standard error, which is cell D10.
05:34Press Tab and we get a lower bound of 12.0279.
05:39What this means, with 95% confidence, is that we are filling our bottles too full,
05:43and we need to adjust their processes.
Collapse this transcript
Testing a small sample's mean
00:00When you need to analyze data based on a small data sample--which means one with
00:04less than 30 data points--you need to use the t distribution to test whether the
00:08sample data provides convincing evidence to support your alternative hypothesis.
00:13When you test the small sample's mean for significance, you're comparing the
00:17measured value against a target value.
00:19In this example, 14 customers filled out satisfaction surveys with 10
00:23questions, and assigned values to 10 aspects of the two trees store, and those
00:28values were from one to 10.
00:30The answers are translated to a score from 0% to 100%, and they are
00:34recorded here in this table.
00:36The first thing we need to do is to count the number of measurements, find their
00:40average, and the standard deviation.
00:42For that, we use these formulas.
00:44For the measurements, it's =COUNT(, and then I'm going to select this table column.
00:53To do that, I move the mouse pointer over the table column header, and when the
00:56mouse pointer is a downward-pointing black arrow, I click the left mouse button,
01:02and that fills in the reference, which is Ratings table, Satisfaction column.
01:07Right parenthesis to close, Return, and I see that I have 14 measurements.
01:12I'll do the same thing for the average;
01:14=average, select the table column again, right parenthesis to close, and the
01:20mean is 0.85, or about 85%.
01:23Then the standard deviation, =stdev, and again, we're working with the sample,
01:29not an entire population.
01:31Left parenthesis and select the table column again. Ratings. Satisfaction. Looks
01:36good. Right parenthesis to close, and we have a standard deviation of 0.08.
01:42I've already entered the target value,
01:44so now I can calculate the standard error, the t-score and the degrees of freedom.
01:50To calculate standard error, we divide the mean by the square root of the
01:54number of measurements.
01:56So it's =D5, which is the mean, divided by the square root--
02:01sqrt is that function--left parenthesis, of the count of the number of
02:06measurements, and that is in cell D4. Right parenthesis to close, and that looks
02:11good. Return and we have a standard error of 0.228.
02:15So now with those values, we can calculate our t-score, which is the number of
02:20standard deviations from the mean that our value occurs.
02:22So I'll type "=(", type in the mean which is in cell D, subtract the target value,
02:31which is in D8, right parenthesis, and then divide those values by the standard
02:37error, which is in cell D10.
02:40Press Return and we get a t-score of 0.23776, a very low t-score.
02:46Now we need to add in our degrees of freedom.
02:49The degrees of freedom are the number of measurements minus one.
02:53So that is =D4-1, and we have 13 degrees of freedom.
03:00You can use the t-score along with the degrees of freedom to calculate the
03:04probability that the mean is equal to the target value of 80.
03:08To do that, I'll type the formula =tdist(, T11, which is the t-score, the
03:16degrees of freedom, which are in cell D12, comma, and then the number of tails.
03:22I want to do a one-tail test,
03:24so I'll type 1, Return, and we have a lower tail of 0.40789.
03:32What that means is about 40.7% of the values under the standard normal curve
03:37will be less than, or to the left of, the value that we computed.
03:42The probability of the upper tail, in other words the probability that the
03:46mean is greater than what we observed, we calculate by subtracting the value
03:51we just calculated from 1.
03:52So =1-D14. The value is 0.59.
04:00In this case, customer satisfaction is quite high.
04:04The two-tail probability is large enough so that we can reject the null
04:07hypothesis that a customer's satisfaction averages the value of 80.
04:12Fortunately, that's because the value is higher than the target.
Collapse this transcript
Testing the mean of a population proportion
00:00When you analyze your business's operations, you'll often concentrate on one
00:03value, such as the state your customers come from.
00:07Now let's say that you want 64% of your orders to come from customers who live in New York.
00:12You can use population proportion analysis techniques to determine if you've met your goals.
00:18I've set up a sample worksheet with the data that you need to use to find this
00:21information, and we can start filling it in with formulas.
00:25First, we need to know the total number of measurements, in other words the
00:29number of orders by state, here in the State column.
00:33To do that, we use =COUNTA, and we use COUNTA to see if a cell has any value in it.
00:42If we were to use COUNT, Excel would only look for numbers, and you wouldn't get
00:45the answer you're looking for. It would display 0.
00:47Left parenthesis. And then we need to identify this table column, so I will move
00:53the mouse pointer over the table column header, and when it is a downward-
00:57pointing black arrow, click, and Excel fills in that table column reference.
01:03Type a right parenthesis. Everything looks good, and we get a total number of measurements of 372.
01:08I already filled in our target state, which is New York.
01:13Now we need to find the number of times that New York occurs in that column.
01:17To do that, we use a COUNTIF formula.
01:19So its =COUNTIF(, and then we are looking at the range which is this
01:26table column here, so I'll just click it again using the same technique I used earlier.
01:32States, State, and then we need to type a comma and indicate what we we're
01:36looking for, and that is the value in D4, which is the State of New York.
01:40Type a right parenthesis, press Return, and we see that we have 259 occurrences.
01:46The sample proportion is the percentage of sample values that match the target.
01:51To find that percentage, we divide the occurrences of New York by the total
01:54number of measurements.
01:55So that's =D5/D3. Press Return and you get a proportion of 0.7.
02:03Now as I stated earlier, our target value is .64.
02:07In other words, our goal was to see whether we can get at least 64% of our
02:12customers come from New York State.
02:14Now we need to check to see whether our sample proportion of .7 is statistically
02:19significant and greater than .64.
02:22So we need to make a few more calculations.
02:24The first is we need to calculate the standard error.
02:28We calculate the standard error for this type of data by using this formula.
02:32It's equal, square root, left parenthesis, of D6, which is the sample proportion,
02:40multiplied by (1-D6. So we are multiplying the sample proportion by one minus
02:49the sample proportion, so .7 times .3, and then we are dividing that value by
02:55the total number of measurements, which is found in cell D3.
02:59Type a right parenthesis, so everything looks right.
03:02Press Return and we get our standard error of .02.
03:06Now we can calculate our z-score, and to do that for this type of data, we use
03:10this formula. It's =(D6,
03:14which is the sample proportion, minus D8, which is the target value. Right
03:19parenthesis. Divide that by the value in D10, which is the standard error, and
03:25that will tell us our z-score. Press Return and we get a value of 2.36.
03:31Now we can use the normal distribution functions to see where that z-score falls
03:35within the normal curve.
03:37We are going to use three separate measurements, so depending upon the type of
03:41analysis you want to perform, you will see those values here.
03:44So for looking for the percentage of the curve that is below our
03:48particular z-score, which is 2.36 standard deviations above the mean, we use this formula.
03:54It's =normasdist(D11). Press Return and we get a probability of .99.
04:07So in other words, that tells us that this value is quite high.
04:11We can do the same thing to calculate our upper tail percentage, or probability,
04:15and that is =1-normsdist(, and our z-score is in D11. Right parenthesis to close.
04:26We get a probability of .01.
04:29And finally, we can calculate the two-tailed probability, which tells us whether
04:32the value is higher or lower than the mean.
04:35So the probability of that is =min(D13:D14). So in other words, we are finding the
04:49smaller of these two values, and multiplying that value by 2.
04:53Press Tab and we get a probability of .02.
04:57These three probabilities let you apply the test you are interested in,
05:00whether the sample proportion is significantly smaller than the target
05:03proportion, larger than the target proportion or if it is greater than or less
05:08than the target proportion.
05:10In this case, the sample proportion of .7 appears to be significantly larger
05:14than the target value of .64, so your efforts to get more orders from New York
05:18customers appear to be paying off.
Collapse this transcript
Estimating the difference between two populations' means
00:00Many businesses have two or more locations,
00:03so you can gain a lot of information about your organization by comparing the stores' results.
00:07In this example, we are assuming that the Two Trees Olive Oil Company has one
00:11location in New York and another New Jersey.
00:14We have more than 30 sample sales values for each store,
00:17so we can use large sample techniques to analyze our data.
00:20We're going to examine the difference between the values of the store's
00:23average order and to determine the interval the actual average will fall into
00:28with 95% confidence.
00:30The first thing we need to do is discover the number of measurements that we
00:34have for New York and New Jersey,
00:36in other words the number of sales that are recorded in this table here on the left side.
00:41To do that, we use a COUNT formula.
00:43So it's =COUNT(, and we are looking for New York in this cell.
00:50So I will go over to the table and hover the mouse pointer over the New York
00:53column header, and when the mouse pointer changes to downward-pointing black
00:57arrow, left-click, and Excel adds that table reference to my formula. Type a
01:03right parenthesis to close it, press Return, and we get 44.
01:07Now we do the same thing for New Jersey.
01:09So it's equal =COUNT( and then select the New Jersey table column using the same
01:16technique I used for New York. The reference looks correct.
01:19Type a right parenthesis, press Return, and we get the same number of measurements.
01:25Now the numbers don't have to be the same;
01:27they just both have to be larger than 30.
01:29Now we need to calculate the mean and to do that we use the AVERAGE formula.
01:34So it's =AVERAGE(, and again we are looking for New York, so select the table
01:40column, right parenthesis, and press Tab.
01:44We get that average.
01:45Now we will do the same thing for New Jersey.
01:46So it's =AVERAGE(, select the New Jersey column, right parenthesis, and
01:53Return, and now we do the standard deviation, and that is the STDEV function.
01:59And we are working with a sample instead of population,
02:02so we will use STDEV instead of STDEVP. Left parenthesis, select the New York
02:07column, right parenthesis, and Tab, and we do the same thing for New Jersey:
02:12stdev, left parenthesis, and select the New Jersey column, right parenthesis, and
02:19Return, and there we have the values.
02:22I mentioned earlier that we are going to be using a 95% confidence level, so I
02:27entered that in here in cell E7, and that means we have a significance level of .05, or 5%.
02:35Your significance level is 1 minus your confidence level.
02:38Now we can calculate our target z-score for the 95% confidence level. And to
02:44do that, we use the formula =normsinv and our probability, which is 1-E8, divided by 2.
02:57We divide the significance level by 2 because we are interested in whether our
03:00measurements are off plus or minus;
03:03in other words, it's a two-tailed test.
03:05If we were only interested in whether the value for one were above the other value
03:10or below the other value, but we didn't want to look in both directions, then we
03:13would not divide by 2.
03:15But in this case, we are looking at both above and below, or greater than or less
03:18than, so we divide by 2.
03:19Type a right parenthesis, Return, and we get the z-score of 1.95.
03:25It's 1.96, which we know from previous experience to be the 95% confidence level z-score.
03:32Now we can calculate our standard error, and this is the amount of expected
03:37error that we would see for each standard deviation. And the formula we use to
03:41calculate that when we are comparing the differences of two means is equal,
03:48square root, left parenthesis, E5, squared--and to do that we type a caret or
03:55capital 6 and a 2--divided by E3, plus F5-- which is the standard deviation for the New
04:06Jersey measurements--squared, so that's ^2, divided by F3 which is the number of measurements.
04:14Type a right parenthesis. Everything looks good.
04:17Press Return and we get our standard error of 12.79.
04:22We get the margin of error by multiplying the standard error by the z-score, and
04:26that is =E11*E9. Return.
04:32So our margin of error is 25.08.
04:36Now we can calculate our point estimate, which is the difference between the two
04:39means here of New York and New Jersey, which are in cells E4 and F4.
04:45So our point estimate is =E4-F4, Return, and we get that value of 38.045.
04:55Finally, we can calculate the lower limit and the upper limit for the difference
04:59between those two averages.
05:01The lower limit is the point estimate minus the margin of error.
05:04So that is =E14-E12, Return, and the upper limit is the point estimate plus the
05:15margin of error, so that is =E14+E12. Press Tab. So we see, with 95% confidence, that
05:26the actual difference between the two means of the average order between New
05:30York and New Jersey is somewhere between $12.96 and $63.12.
Collapse this transcript
Testing hypotheses related to population means
00:00As the owner of a business, you're always looking for ways to improve your processes.
00:04When you try something new, you should analyze your results to determine whether
00:07the new process is better, the same, or worse than your current plan.
00:11In this movie, I'll show you how to compare the results of two olive oil
00:14purification processes used by Two Trees.
00:18Our data is in a table in the first two columns of the worksheet.
00:21These are the measures of purity for the two processes.
00:24Process1 is the existing process, and Process2 is the new process that the
00:29company is testing.
00:31Because we're testing a hypothesis, we need to state a null hypothesis.
00:35The null hypothesis says that the average purity levels for Process1
00:39and Process2 are equal.
00:40In other words, there is no difference between the average purity as a result of either process.
00:47The data analysis tool we'll use requires us to calculate the variances of the
00:51two data sets, and input the values manually.
00:54So we'll create VAR formulas to calculate those values.
00:58So in cell E3, I'll type "=VAR"--
01:03and this is variance for a sample, not an entire population, so we use
01:07VAR instead of VARP.
01:08Left parenthesis and then I will select the Process1 column.
01:13To do that, I move the mouse pointer over the column header, and when the mouse
01:17pointer changes to a downward-pointing black arrow, click, and the reference
01:22appears within the formula.
01:24Type a right parenthesis, press Return, and we see the variance.
01:27Now, I'll do the same thing for Process2:
01:29=VAR, left parenthesis, Process2, table reference is correct, right parenthesis,
01:38Return, and we have our values.
01:41Now we can use the data analysis tool that is called z-Test:
01:45Two Sample for means.
01:47To run that, we go to the Data tab of the Ribbon and then in the Analysis group
01:52click the Data Analysis button.
01:54If you don't see the Data Analysis button or the Analysis group on your Ribbon,
01:58then you need to install the Analysis ToolPak, and I show you how to do that
02:02in the movie near the beginning of the course.
02:04But now, let's assume it's installed. Click Data Analysis, and then, at the
02:10bottom of the list, click z-Test:
02:13Two Sample for means, and click OK to run it.
02:17In the dialog box, I can select our variables. And the variables for Range 1 are
02:23in the table column for Process1.
02:25So I click the Collapse dialog button, and select the column, same way I did
02:30when I was creating the variance formulas.
02:32Click. And cell range appears there, so I can click the Expand Dialog button.
02:37Do the same thing for the variable 2 Range. Select that, the references look good,
02:44and expand the dialog.
02:46Now I need to enter my hypothesized mean difference, and if I move the dialog
02:50box out of the way of the null hypothesis, you'll see that I'm hypothesizing
02:55that the average purity levels for Process1 and Process2 are equal.
02:58That means the mean difference that I'm hypothesizing is 0. So I'll type 0.
03:05Now we need to enter in the variance for Variable 1 and Variable 2.
03:10You can't use a cell reference for this.
03:12I'm not sure why, but the tool just doesn't allow it.
03:14But that's why we created the variance formulas inside of our worksheet, and we
03:18can type in the values.
03:20So the Variable 1 variance is 37.4989247.
03:27Then press Tab to move to the Variable 2 box, and that is 35.4494624.
03:35We don't have any labels in our selections.
03:37We selected A4 through A34, and A3 is where the labels are, so we won't
03:42have them this time.
03:44Our alpha value is the significance level that we're looking for, and it is
03:48set by default to .05.
03:50We can change it, but an alpha, or significance level, of .05 is equivalent to a
03:55confidence level of .95 or 95%.
03:59That's the standard measure used in most statistical analyses.
04:03Finally, we can select our output range.
04:06I want to put the data on the same worksheet,
04:08so I will leave the Output Range option button selected, click the Collapse
04:13Dialog button, and click cell D6.
04:19Click the Expand Dialog button. All my values look correct,
04:22so I can click OK, and Excel adds the data into the worksheet.
04:26I will expand column D, so that you can see all of the values.
04:32So as part of this tool, Excel calculates a number of values.
04:35We have the mean for Variable1 and Variable2, which are Process1 and Process2.
04:40We also have the number of observations, and then down at the bottom we
04:44have our statistics.
04:46The first statistic is the z-score, which indicates how far the difference in
04:51these two variances are from the mean.
04:54In other words, it's asking how probable is it that the difference between the
04:58means is due to chance?
05:00We have the z-score of 2.712, which means that the probability of the difference
05:06of the means being completely due to chance is less than 1%.
05:11It's 0.003, which is equal to 0.3 of a percent.
05:16By comparison, the critical value for the z-score is 1.644, and because the z-
05:21score that we got, 2.7, is significantly higher,
05:25then we have a good idea that Process 1 and Process 2 do have different
05:30means. And as we can see, the mean value for Process 1 is significantly
05:34higher than Process 2,
05:35so we should stay with Process 1, and either work on refining Process 2 or find
05:40another process entirely.
Collapse this transcript
Inferring results from matched samples selected from two populations
00:00If you've ever been to an eye doctor, you're probably familiar with the
00:03comparison sequence, "which is better, A or B?"
00:07It is much the same for businesses. In many cases,
00:09you'll need to determine which of two competing processes produces
00:13more desirable results.
00:15In this movie, I'll show you how to analyze two data sets that
00:18use matched samples.
00:20A matched sample means that for every sample in list one
00:23there is a matching B sample in list two.
00:26In this scenario, Two Trees Olive Oil Company is testing two different processes
00:31to purify olive oil.
00:33The values represent the purity level after Process 1 and Process 2
00:37were used on oil from the same batch.
00:39You can perform this type of analysis, which is a paired t-test analysis,
00:44using the Paired t-Test tool that's part of the Analysis ToolPak, to
00:48determine whether the differences between two sets of matched pair
00:50measurements are significant.
00:53To use the Analysis ToolPak, you click on the Data tab of the Ribbon, and then
00:58at the far right edge of the Ribbon, you click Data Analysis.
01:02If you don't see the Data Analysis item on the Ribbon on the Data tab, then you
01:07need to install the Analysis ToolPak, and I showed you how to do that in a
01:11movie early on in the course.
01:12But for now, we'll assume it's installed.
01:14So we can click Data Analysis, and then in the Data Analysis dialog box we can
01:21scroll down and click t-Test:
01:24Paired Two Sample for Means. Click OK and the dialog box of the same name appears.
01:31So the first thing we need to do is select the ranges that contain our
01:35variables. So I'll click the collapse dialog button. And then rather than
01:40selecting this entire table column by moving the mouse pointer until it changes
01:44into a downward-pointing black arrow, I'm going to select the cells
01:47individually, so the mouse pointer should be what's called a Greek cross, or a
01:52white four-way cross, and I will drag down to select cells B1 through B15.
02:00And the reason I'm doing that is because I want to select the column label, so
02:04that it will appear in our results and make them easier to understand.
02:07So I'll click Expand Dialog button, and we see that Variable 1 Range is correct.
02:11Now I will click in Variable 2 Range, click the Collapse Dialog button next to
02:16that field, and select again.
02:18The mouse pointer should be the white four-way cross. Drag down to cell C15,
02:26click the expand dialog button, and we see that we have our correct cell ranges selected.
02:33Now the dialog box asks us to enter a hypothesized mean difference.
02:38It is an optional argument, but it helps with the calculations if you can put one in.
02:43And in this case, I am going to hypothesize that the difference between the two
02:46values is .01. And if you look at the values in the table, and you look at the
02:51pairs, you'll see that that's not an unreasonable assumption.
02:54I am going to select the Labels check box, because the labels are in the
03:00cells B1 and C1. And the Alpha, which is the significance level, I am going to leave at .05.
03:07If you subtract your significance level from 1, then you find your confidence level.
03:11So 1-.05 is 95%, which is the usual confidence level that's used in statistics.
03:18I do want to put my results on a new worksheet, so I will leave that option
03:22button selected. Click OK and I get my results.
03:27I'm going to reconfigure the worksheet so that we can read everything.
03:30I'll double-click the column margin at the right of the A column, so that we see
03:37every value there and then Method 1 and Method 2. I'll double-click for column
03:42edges to expand those columns as well. And I'll zoom in a little bit, so that
03:48the values are easier to read.
03:52Now we can examine our statistics.
03:54So, we find the t statistic is 1.821, and we can compare that to our one-tailed
04:01t statistic of 1.77, and the corresponding probability of .045 or 4.5%.
04:11The t statistic is larger than the critical one-tail statistic, so that means
04:15that the value is significant.
04:17However, it is very close to the .05 value, or 5%, so you will probably want to
04:23take a larger sample to analyze this data more thoroughly.
04:27Because this example used only 14 observations, you should probably run
04:30another batch of tests, preferably with larger sample sizes, to see if there is a
04:34significant difference between the two processes.
Collapse this transcript
Analyzing variance using an F-test
00:00In this course, I've shown you several tests to analyze the characteristics of
00:04two data sets to see if the differences are statistically significant.
00:08In this movie, I'll show you how to use the F statistic, which helps you
00:12evaluate a hypothesis that there is a difference between two groups based on the sample data.
00:17Say you have a data set with customer satisfaction percentages from customers that
00:22are from Connecticut and from New York.
00:24You can use an F-test to see if the variances of these two data sets are significant.
00:29To do that, you click the Data tab on the Ribbon, and then in the Analysis group
00:34at the far right edge, click Data Analysis.
00:38If you don't see the Analysis group or the Data Analysis button on the Data tab,
00:42then you need to install the Analysis ToolPak add-in.
00:45I showed you how to do that early on in the course.
00:48So now we click Data Analysis, and in the list of Analysis tools, click F-Test
00:55Two-Sample for Variances, and click OK.
00:59When you do, the dialog box of the same name appears.
01:02Now we need to let Excel know which variable ranges we're going to use,
01:05in other words which worksheet cells contain which values.
01:08So I'll click the Collapse Dialog button next to the Variable 1 Range box and
01:14select A3 through A9.
01:17I am selecting the label that will make our output easier to understand.
01:21And click the Expand Dialog button.
01:25Now I'll do the same thing for the Variable 2 Range.
01:28So, I'll click there, click the Collapse Dialog button, select B3 through
01:36B9 again, including the header, click the Expand dialog button, and the
01:41variable range appears.
01:43I will select the Labels check box, because I did include the labels in my
01:47selection, and I'm going to leave the Alpha level at 0.05.
01:51Alpha is the same as the significance level, and it is one minus the confidence level.
01:57So, for example, an alpha level of 0.05 corresponds to a confidence level of
02:030.95, or 95%, which is the standard confidence level used throughout most
02:08statistical analyses.
02:09I am going to put my outputs on the same worksheet.
02:13So I will click Output Range and then click the Collapse Dialog button
02:17and select cell D3.
02:20So that will be the top-left corner of the table that Excel creates.
02:24Click the Expand Dialog button. Everything looks good, so I'll click OK.
02:29When I do, my results appear.
02:31Now, I'm going to change the size of the first column, and click here, so that
02:38you can read the values more easily.
02:40So you can see the tool calculated the mean, the variance, and the number of
02:45observations, and degrees of freedom for each of these two columns.
02:49At the bottom, we have the F statistic.
02:51The F statistic is 1.295.
02:55The critical one-tail F statistic-- in other words the value that the F
03:00statistic would need to exceed to be statistically significant for 5 degrees
03:05of freedom--is 5.05.
03:08So that means that the probability that the variance or the difference in
03:12variance between the two samples is significant is only 39%.
03:18That's well below the threshold of what would be considered significant,
03:21so we can say that based on this small sample, the difference of the variances
03:26between customer satisfaction for customers from Connecticut and New York is not
03:30statistically significant.
03:32Calculating an F statistic helps evaluate whether there is a
03:35significant difference between two groups, but the F-test is very
03:39sensitive to non-normality.
03:41In other words, if you suspect your data isn't described by the normal
03:44distribution, you should not use the F-test.
Collapse this transcript
Analyzing variance using ANOVA
00:00ANOVA, or Analysis of Variance, uses the F statistic I demonstrated in the last movie.
00:06Unlike the F statistic by itself, you can use ANOVA to compare the averages of
00:11more than two groups.
00:12In this movie, I'll show you how to use Excel's ANOVA tool to analyze your data.
00:16ANOVA analysis checks for the significance of differences in values based on
00:21factors such as age, state of residence, or education level.
00:25As an example, as you can see in this worksheet, let's suppose that you perform a
00:29series of customer satisfaction surveys and group the results by customer type:
00:34Restaurant, Reseller, or a Private individual.
00:37You can perform an ANOVA test to determine whether customer satisfaction levels
00:41vary significantly among those categories.
00:44When you analyze this data, you set a null hypothesis, which states that the type
00:49of customer does not significantly influence satisfaction.
00:52To perform the analysis itself, you go to the Data tab of the Ribbon and then in
00:58the Analysis Group, click Data Analysis.
01:01If you don't see the Analysis Group or the Data Analysis button in the Analysis
01:06Group on the Data tab, then you need to install the Analysis ToolPak add-in.
01:10I showed you how to do that in a movie early on in this course.
01:13But now let's assume it's installed, and click Data Analysis.
01:18Then in the dialog box that appears, click Anova:
01:22Single Factor, and click OK.
01:25In the dialog box that appears we can select our Input Range.
01:29So I'll click the Collapse dialog button, and then select cells A3 through C9,
01:36and click the Expand dialog button, and the range we selected appears in the
01:41Input Range. And our labels are in the first row,
01:44so I will select that check box. And I selected the labels so that our output
01:49will be easier to understand.
01:51Now, the Alpha level is 0.05, and alpha is the same as the significance level,
01:57which is the confidence level subtracted from 1.
02:00So in other words, an alpha of 0.05 is the significance level of 0.05, and it's
02:05equivalent to a confidence level of 0.95 or 95%.
02:10That's the confidence level used in most statistical analysis.
02:14Now, I can select my output options, and I will use an output range.
02:19I'm going to make the leftmost cell of my output E3.
02:24So, I'll select there, click the Expand dialog box, everything looks good, and I'll click OK.
02:33I'll zoom out just a touch, so that you can see the data, and I'll expand column
02:39E, so that you can read the labels.
02:41So we have our Anova:
02:42Single Factor analysis, and the factor we're using is customer type.
02:46So we have Restaurant, Reseller, and Private, and you see the different summary
02:49statistics for each of those.
02:51So, for example, you'll notice that the reseller's average satisfaction rating
02:56is larger than that for restaurant, or private customers.
02:59The only question is is this significant?
03:02For that, we can look down at the ANOVA table and analyze it this way.
03:06The bottom section of the results contains the P-value, which is 0.059.
03:12That value is greater than 0.05, so that means that we can't quite reject
03:17the null hypothesis that satisfaction differs significantly according to customer type.
03:22You can also compare the F statistic to the F critical statistic.
03:27So the F statistic is 3.43, and the critical value for F is 3.68.
03:34So the probability of customer type being significant is very, very close to being true.
03:40Based on this analysis and the fact that you only took six surveys for each
03:45customer type, you should probably do another survey, probably with a larger
03:49sample size, in the near future.
Collapse this transcript
8. Using Linear and Multiple Regression
Creating a linear regression line for an XY (scatter) chart
00:00Anytime you have two data sets that you think might be related, you can
00:03use Excel's Regression tools to analyze the relationship, and see if it's significant.
00:08In this case, for example, let's say the Two Trees Olive Oil Company wanted
00:12to compare the number of articles written about it in the popular press versus its yearly sales.
00:17To do that, you can create an XY scatter chart and then add a trendline to see
00:22if there's a significant relationship between those two sets of values.
00:26To create the chart, you select any cell in your Excel table, and then on the
00:31Insert tab, click Scatter, and then click the first subtype, which is Scatter with only Markers.
00:42When you do, the Excel creates the XY scatter chart.
00:44We have the number of articles along the bottom and sales volume along the top.
00:51Now we can add a trendline to see if there is a significant relationship between
00:55these two sets of values.
00:57To do that, you go onto the Layout contextual tab, click Trendline, and
01:03click Linear Trendline.
01:06Excel displays the trendline, and you see that it has a fairly steep, upward slope,
01:10and that is a good indicator that there is a relationship between the two variables.
01:14We can go step further and add the correlation coefficient, which I talked about
01:19earlier in this course, for this trendline.
01:21So click Trendline, click More Trendline Options, and then click Display Equation
01:28on chart and Display R-squared value on chart.
01:33When I click close, you can see that the values have appeared.
01:36You use the equation to draw the trendline by taking the value of x--and x is
01:42the number of articles--multiplying it by 6227.3, and then adding a base value of 55,413.
01:52The R-squared value is the correlation coefficient squared, and that is a measure of
01:57the line's goodness of fit.
01:59Excel creates the trendline based on a series of computations that find the line
02:03that best fits its data.
02:05It's a lengthy process and one that's ideally suited to computers.
02:08In this case, the R-squared coefficient of 0.9497 is extremely high.
02:15Any value over 0.8 is considered to be strong. Any value over 0.9 is considered
02:20to be extremely strong.
02:21So it appears that there is a very high correlation between the number of
02:25articles written about Two Trees Olive Oil Company and their yearly sales.
Collapse this transcript
Analyzing and interpreting data from the Regression tool
00:00When you add a trendline to a chart, you have the option to display the line's
00:04equation and R-squared value.
00:06If you'd like more information, including the data set's variance and standard
00:10deviation, you can display those values in your workbook by using the Regression
00:14tool in the Analysis ToolPak.
00:16I've already created a chart that has the equation and the R-squared value for this dataset.
00:22So in this case, a y value is equal to 1.7394 times X, which are the values along here,
00:30or the values in the left-hand Impressions column, and you add a value of .7333
00:36to generate the y value.
00:38If you want more information, you can use Excel's Regression tool.
00:42To do that, you go on to the Data tab of the Ribbon and then in the Analysis
00:47group, click Data Analysis.
00:50If you don't see the Data Analysis button or the Analysis group on the Data
00:53tab of the Ribbon, that means that you might not have installed the Analysis ToolPak add-in.
00:59I show you how to do that in a movie earlier on in this course.
01:02But now let's assume that it has been installed and click Data Analysis, and then
01:07in the dialog box that appears, click Regression.
01:10So I have selected Regression, and click OK.
01:14Now in the Regression dialog box, I can select my values.
01:17The Y range, which will represent the vertical axis in the chart, are the Sales.
01:23So I'll click the Collapse Dialog button and then select cells B1 through B11.
01:29Now notice that I am not selecting the entire table column, because if I do that,
01:34such as by clicking the column header when my mouse pointer is a downward-
01:38pointing black arrow, I would only select the data cells;
01:41I wouldn't select the label.
01:43I do want the label,
01:43so I am going to leave the mouse pointer as a four-way white cross and drag
01:49from B1 through B11.
01:50And it appears in the dialog box, so I'll click the Expand Dialog button.
01:56Now I can select the values for the X range by clicking in the X Range box,
02:01clicking the Collapse Dialog button, and selecting A1 through A11. Click the
02:09Expand dialog button, and now I can set my other options.
02:13I did select Labels that will make our data easier to understand.
02:16So I will select the Labels check box.
02:18I am going to put this data on a new worksheet ply or a new worksheet,
02:24so I will select that option button.
02:26I don't need to select anything else,
02:28so I will click OK and have Excel generate the results.
02:32Now the results are here on the new worksheet. But rather than work with them
02:36here, I am going to go to a sheet that I prepared earlier that has the original
02:40data, the chart, and the summary output all on the same worksheet, and that is on Sheet3.
02:44So now we can look at all the data at the same time.
02:49The chart displays the R-squared value as well as the equation used to generate
02:55this line. And you can see those same values over here in the bottom part of the Output.
03:00So, for example, the Impressions coefficient of 1.739, 39 repeating, is
03:07equivalent to the first term in the equation, which is 1.7394 times X.
03:10So in other words, we are multiplying the impressions by this number here and
03:17then adding the coefficient of the intercept, which is .7333, repeating.
03:24The intercept is the value that this line crosses the Y axis.
03:29So in other words, we start at .7333, repeating, and for every value in X, we
03:35multiply X by 1.7394 and go up that much.
03:40You might recognize this formula as the slope intercept formula from Algebra 1.
03:44Now the R value, which is the correlation coefficient, appears here.
03:49It's called Multiple R. And again, the correlation coefficient of .99 is
03:54extremely high. The maximum value is 1, so any value of .99 is extremely high.
04:00That means that there is a very strong correlation between the number of add
04:04impressions that an individual has seen and the number of sales that you make to them.
04:08The R-squared value is a more conservative measure.
04:11It is this value squared.
04:13So, for example, if you square the value .5, you get .25, which is extremely low.
04:19So an R-squared of .98 is extremely large.
04:23By the same token, the standard error of .79 is extremely low in relation to the
04:29values in the Sales column.
04:32So that means that we have a very good sense that the number of impressions does affect sales.
04:39I cover calculating and interpreting an F score, which you can find here, using
04:43ANOVA techniques in another movie, but note that the associated probability is
04:47also extremely low--that is the significance. That indicates that there is very
04:52little likelihood that the values' variances are due solely to chance.
Collapse this transcript
Estimating and predicting data using the regression formula
00:00When you perform linear regression analysis in Excel, the program generates an
00:03equation that best describes the data.
00:06Excel uses that equation in numerous ways.
00:09In this movie, I'll show you how to forecast future values without creating a
00:13trendline or a chart.
00:15If you want to forecast future values based on existing values, you can use
00:19the FORECAST function.
00:20So in cell E2, I am going to type =FORECAST(, and now I need to input my X, and
00:28that would be the year.
00:29I am working with yearly sales data, so I want to know, based on current trends,
00:33what would a future year's sales be?
00:35So that will be in cell E1.
00:37That cell is currently blank, but I'll fill in a value in a moment.
00:40Now we have the known Ys.
00:44The Ys are the dependent variable, and in this case those are the Sales.
00:49My Year is my independent variable. Those just happen.
00:52There is no variation.
00:53It's always going to be one year later when it's one year later. But Sales can
00:57vary, and they vary by year.
00:59So we have E1 and our known Ys are in the second column, the Sales column of the table.
01:06I can select that column by moving the mouse pointer over the table column
01:10header, and when it's a downward- pointing black arrow, clicking, and you can see
01:14I get my table reference.
01:17And now we have the known Xs, which are in the Year column, so I can select that.
01:21Type a right parenthesis, press Return, and we get a very small value, and
01:27that's because there is no year, so it's using a zero.
01:29Now, let's say that I want to discover the potential sales for the year 2011.
01:33If I type in 2011 and press Return, then we get a value of 480,285.7, which just
01:41on visual inspection is consistent with the values here in the worksheet.
01:47If I change the year to 2012, then we get a value of 525,000, which again is
01:54consistent with our experience and our existing data.
01:56Now there is one other way to use linear regression in Excel, and that is by
02:02extending a data series.
02:03You might remember from your other experiences with Excel that if you type in
02:08two values in a worksheet cells, select them and then grab the fill handle at
02:14the bottom and drag down, Excel extends the series, so 1, 2, 3, 4, 5, 6, 7,
02:188, 9, 10, and so on.
02:21You can do the same thing with data that doesn't follow a regular
02:24progression and Excel will calculate the Linear Regression formula and
02:29extend the data by those amounts.
02:31So let me show you how to do that in the table.
02:34Let's say that you select cells A2 through B8. And this time I'm not selecting
02:39the labels because they're not numerical values.
02:42If I select all of these cells--and it is important that you select the
02:45entire data set or your results might not be correct--and then you grab the fill
02:49handle and drag down, you can see the tooltip indicating the year that you're going to display.
02:55So we'll go all the way up to 2015, release the mouse button, and Excel extends
03:00the series using the Linear Regression formula.
03:04Each of these techniques lets you extend the data series using the linear
03:07regression shown earlier in the course.
03:09I personally use the fill handles to project future values.
03:12Just be sure that you select the entire data set, so Excel takes every value
03:16into account when it derives its regression equation.
Collapse this transcript
Developing a multiple regression model
00:00In the previous movies in this chapter, I showed you how to have Excel use
00:03linear regression to derive an equation used to project future values, and to
00:08estimate how much of the value's variance was due to the data in your samples.
00:12In this movie, I'll show you how to perform a similar analysis for multiple factors.
00:16This technique is called multiple regression.
00:19When you create your source data table, you need to have two regressors, and
00:23those are the independent variables you want to use in the Regression formula to
00:27derive the dependent variable.
00:28Your two regressors--in this case Impressions, which is the number of times an
00:33individual sees one of your company's ads, and Proximity which is the distance
00:37in miles that a person lives from your store--
00:40we want to see how these two factors when working together relate to the revenue
00:46that you generate, or in other words the amount that you sold them.
00:49To perform multiple regression analysis, you use the Data Analysis tool, and
00:55that is on the Data tab in the Analysis Group.
00:58If you don't see the Analysis Group or the Data Analysis button on the Data tab
01:02of the Ribbon, then you need to install the Analysis ToolPak add-in.
01:06I showed you how to do that in a movie early on in the course.
01:10With the Analysis ToolPak installed, we can click Data Analysis, and then in the
01:15Data Analysis dialog box, click Regression, and click OK.
01:21When we do, the Regression dialog box appears, and we can select our input ranges.
01:27The first input range is the Y range.
01:30In regression analysis, the Y values are the dependent values, and those are the revenue.
01:36So if I click the Collapse Dialog button and click the Revenue column header,
01:42when my mouse pointer is a downward- pointing black arrow, Excel will select all
01:47of the data in that column, and I can click the Expand Dialog button.
01:52Now we need to select my input ranges for the X values, and those are the regressors.
01:56So if I click the Collapse Dialog button and then drag to select columns B
02:05and C, and click Expand Dialog button, B2 through C11 comes in as the input
02:10range for the X values.
02:12Now, I can set my output range.
02:14This is a fairly large table, so I will put it on a new worksheet.
02:17That is selected. Everything looks good,
02:19so I can click OK, and Excel performs the analysis.
02:23I'm going to expand the columns, so it displays the entire value.
02:27I'll do that for A, B, C, and F, G, H, and I. And I'll click so that it's no
02:37longer selected, and we can see all of the values.
02:40I will also increase the zoom because we're going to focus on the data here at
02:45the top, as opposed to the data on the bottom, because you'll be familiar with
02:48this data from previous discussions of regression.
02:50So I'll just zoom to about 130%, and now we can look at the
02:57regression statistics.
02:58Specifically, I want to focus on the R Square value.
03:03The R Square value is also called the goodness-of-fit measure and what it
03:07measures is how closely the equation that Excel derives explains the data.
03:11In other words, how predictive is it.
03:13Given the two regressors, which were ad impressions and proximity to the store,
03:17how well do those two factors in combination describe your data?
03:21You can interpret the R-squared value as a percentage of the final value that is
03:25predicted by the two regressors, and in this case, that percentage is 18.6%.
03:31That's an extremely low value, which means that the two variables together do not
03:35do a good job of explaining the variability in revenue.
03:38When you see a multiple regression result like this, which returns an extremely
03:41low R-squared value, you should perform a simple linear regression analysis on
03:45the independent variable separately to see if one independent variable has a
03:49strong relationship to the dependent variable.
03:52In this case, you would check Ad Impressions to see if related, and Proximity
03:57to see if it related.
Collapse this transcript
Conclusion
Additional Resources
00:00Thanks for taking the time to work through "Excel 2007 Business Statistics".
00:04I hope you've learned a lot, and that you can apply the techniques you have
00:07learned to your business.
00:09If you want to learn more about statistics, there are several resources that you can turn to.
00:13The first is a book by Professor Neil J. Salkind from the University of Kansas,
00:18and his book is called "Statistics for People Who (Think They) Hate Statistics".
00:22It's a great entry-level introduction, and in my opinion, it is the best
00:26statistics book under $100.
00:29The best business statistics book that you can find over $100, in my opinion, is
00:34"Business Statistics: A First Course (5th Edition)" by Levine, Krehbiel, and Berenson.
00:40They go into a lot of detail, many different techniques,
00:43so if you want to add some very advanced techniques to your toolkit, that's the book to get.
00:49And finally, on a related note, we have "Excel 2007 Financial Analysis", one of my
00:54own courses that's available on the lynda.com Online Training Library.
00:58Thanks again for your time, and best of luck with "Excel 2007 Business Statistics".
Collapse this transcript


Suggested courses to watch next:

Excel 2007: Financial Analysis (2h 18m)
Curt Frye


Excel 2007: Data Validation in Depth (58m 36s)
Dennis Taylor


Are you sure you want to delete this bookmark?

cancel

Bookmark this Tutorial

Name

Description

{0} characters left

Tags

Separate tags with a space. Use quotes around multi-word tags. Suggested Tags:
loading
cancel

bookmark this course

{0} characters left Separate tags with a space. Use quotes around multi-word tags. Suggested Tags:
loading

Error:

go to playlists »

Create new playlist

name:
description:
save cancel

You must be a lynda.com member to watch this video.

Every course in the lynda.com library contains free videos that let you assess the quality of our tutorials before you subscribe—just click on the blue links to watch them. Become a member to access all 104,141 instructional videos.

get started learn more

If you are already an active lynda.com member, please log in to access the lynda.com library.

Get access to all lynda.com videos

You are currently signed into your admin account, which doesn't let you view lynda.com videos. For full access to the lynda.com library, log in through iplogin.lynda.com, or sign in through your organization's portal. You may also request a user account by calling 1 1 (888) 335-9632 or emailing us at cs@lynda.com.

Get access to all lynda.com videos

You are currently signed into your admin account, which doesn't let you view lynda.com videos. For full access to the lynda.com library, log in through iplogin.lynda.com, or sign in through your organization's portal. You may also request a user account by calling 1 1 (888) 335-9632 or emailing us at cs@lynda.com.

Access to lynda.com videos

Your organization has a limited access membership to the lynda.com library that allows access to only a specific, limited selection of courses.

You don't have access to this video.

You're logged in as an account administrator, but your membership is not active.

Contact a Training Solutions Advisor at 1 (888) 335-9632.

How to access this video.

If this course is one of your five classes, then your class currently isn't in session.

If you want to watch this video and it is not part of your class, upgrade your membership for unlimited access to the full library of 2,025 courses anytime, anywhere.

learn more upgrade

You can always watch the free content included in every course.

Questions? Call Customer Service at 1 1 (888) 335-9632 or email cs@lynda.com.

You don't have access to this video.

You're logged in as an account administrator, but your membership is no longer active. You can still access reports and account information.

To reactivate your account, contact a Training Solutions Advisor at 1 1 (888) 335-9632.

Need help accessing this video?

You can't access this video from your master administrator account.

Call Customer Service at 1 1 (888) 335-9632 or email cs@lynda.com for help accessing this video.

preview image of new course page

Try our new course pages

Explore our redesigned course pages, and tell us about your experience.

If you want to switch back to the old view, change your site preferences from the my account menu.

Try the new pages No, thanks

site feedback

Thanks for signing up.

We’ll send you a confirmation email shortly.


By signing up, you’ll receive about four emails per month, including

We’ll only use your email address to send you these mailings.

Here’s our privacy policy with more details about how we handle your information.

Keep up with news, tips, and latest courses with emails from lynda.com.

By signing up, you’ll receive about four emails per month, including

We’ll only use your email address to send you these mailings.

Here’s our privacy policy with more details about how we handle your information.

   
submit Lightbox submit clicked