You have a collection of data. You would like to quickly identify the mean and standard deviation of the data.
- [Instructor] Our next descriptive statistic covered will be the mean, also called the average, and the standard deviation. In this video, we will use the sum and length functions to compose the mean of a data set and we'll explore the sum and length functions, compose our main function and then we will use that mean function in order to compose a standard deviation function. Finally, we're going to compute the mean and standard deviation of the 2015 away team runs using our function. The mean is a summary statistic which gives you a quick idea of the middle, well not truly being the middle of a data set.
The mean is trivial to calculate and thus it is frequently used. And all it is is the sum of that data set divided by the number of values in that data set. We also discussed sample standard deviation which is the mean distance from the mean and a measure of a data set spread. The approach that we will be using is known as the sample standard deviation. I have the function presented here for your reference. Let's go over to our Linux environment and here's where we left off discussing the range of a data set. I would like to scroll up to the very top of our notebook and add a new import.
And the import which I would like to add is import Data.Maybe. Now because I have added a library, what I like to do each time I add libraries is to do a restart and rerun all. It's okay to do this. Will take a moment and it will reload all of our variables. Close that down. In order to compute the mean of a data set, we add up all the values and divide it by the length of those values.
So in order to find the sum of all the values in that list, we use sum awayRuns. There were 10,091 runs scored in the 2015 season by away team. We also need know the length of the away runs. There were 2,429 games. We divide the first number by the second and we get our average. But we need to explore the type of the sum and the length functions. We can see that the sum takes a list of values and returns the value and the inputs and the outputs are bound by the Num type.
Whereas the inputs on length aren't bound by anything and they always return an int. Now the division operator in Hascal doesn't work with int so what we need to do is we need to convert the values returned by sum and length to something that we can work with. And so the function that we will be using is realToFrac sum awayRuns divided by fromIntegral length awayRuns.
So our average is that 4.15 runs per game or scored by away teams in the 2015 season. We use this information in order to compose our mean function. And I already have it pasted here, let's paste that. Much like our range function, we have a return type of a double that's been packaged into a Maybe. And we have a list of values that are bound on the Real type. Now our function uses pattern matching in order to handle the variety of inputs and outputs that we will likely receive, much like we did with the range function in the last video.
So if we have a list of no values, we return Nothing. Now it's best that we return nothing and not zero because zero could be interpreted as a mean of a data set. If we have a single value, well then we're just going to return that value bundled in a Just. And if we have a list, well then we're actually going to implement our sum and length functions that we described earlier. So let's test this out. If we get mean of an empty list, we should get Nothing.
If we get mean of a single value, we should get that value converted to a double. And we have mean of a three list, we should get our average. Now any function which uses our mean function is going to have to interpret the value inside of a Maybe. So in order to do that we use a function called from Just. I am now going to find my pre-written code for the standard deviation. I do this because it's easier just to copy and paste this code, than to type it up.
I'm going to paste this here and we're going to talk about the standard deviation code. Much like the mean function which we wrote earlier, we have our inputs bound by a Real type and we will be returning a Double package to the Maybe. And for historic reasons I call this function stdev so statistical spreadsheet software and statistical packages will call this particular function stdev. And this is a recreation of the formula which I showed you in the beginning of this video which produces the sample standard deviation.
It's important to note that the sample standard deviation requires at least two values in order to compute a spread. You can't very well compute a spread with one value and so we need to use pattern matching in order to detect that. So if we have a empty list, we return Nothing. If we have a list of just one item, we still return Nothing. And from there we actually implement the formula necessary for the sample standard deviation. Let's do a few test.
So the standard deviation of a blank list is Nothing. The standard deviation of a single item is still Nothing. And the standard deviation of our awayRuns is 3.12. And what we can do with that information is our average is 4.15 and I can subtract 3.12, and I can say our average again is 4.15 and I can add a 3.12. And I can say that one standard deviation range of our away team runs for the 2015 season is 1.03 runs to 7.27 runs.
And that gives us a good idea of where the majority of the scores were for away teams in the 2015 season. So in this video we looked at the mean and the standard deviations of a data set. We implemented the functions, we discussed the sum and the length functions which were necessary for those functions. And then we did a few examples of how we could find the mean and standard deviation with the functions that we prototyped. In the next video we will be discussing the median of a data set.
Note: This course was created by Packt Publishing. We are pleased to host this training in our library.
- Data ranges, means, and medians
- Standard deviation
- SQLite3 command line
- Slices of data
- Regular expressions
- Atoms and modifiers
- Character classes
- Line plots of a single variable
- Plotting a moving average
- Feature scaling
- Scatter plots
- Normal distribution
- Kernel density estimation (KDE)