You have a collection of data. You would like to quickly identify the mode of the data.
- [Instructor] The mode is the value in a list which appears the most frequently. In this video, we are going to discuss an algorithm for finding the mode. So, in this video, we're going to take a look at understanding how the mode of a list can be solved using run length encoding. We're going to break that problem of run length encoding into parts, and then write the code for our function. We're going to use run length encoding in order to find the mode of a dataset, and then we're going to compute the mode of our 2015 away runs dataset.
I am in my virtual machine, and I need to go back up to the very top of my Baseball dataset and I need to do yet another import. This import is going to be import Data.Ord. We need this for a function that we'll use later on in this video. Now, let's restart and rerun all. It'll take a moment. There we go. Great. Now, let's create a list called myList that we will use in order to demonstrate the mode.
So, imagine we have a list consisting of the values four, four, five, five, four. Now, the value that appears the most frequently in this list, of course, is four. Now, what I would like to introduce an algorithm known as run length encoding. Now, run length encoding is an algorithm for lossless compression. It has a few interesting applications to it. Now, we can find the mode of a list by first running run length encoding. And in order to find run length encoding, we need to understand how elements group together.
So, there is a function in Data.List called group. And with group, we can create list of list and each sublist in our primary list is a grouping of the values and so, here we have myList, and then we have group myList four, four, five, five, four. Now, what we can easily do is this grouping and count each element in the sublist, thus creating a run length encoding.
So, let's create a function to represent run length encoding. We need this to be of type for our values. We're going to accept any element as input and then we are going to return a list consisting of a tuple of those elements followed by an integer, where the integer is going to represent the number of sub elements in that list. So runLengthEncoding is going to be whatever list we get in, we are going to map over that list.
With that sublist we will first get the head of the list. And second we will get the generic length of xs. Now, once we get that generic length, we're going to compute the group. So, there we go. So, if I pass in run length encoding of our myList I compute the run length encoding of our original list, where each element in order represents the element that is seen and how many times that element is seen.
So, four, two, five, two, four, one. There'll be an even number of elements and for convenience's sake we put them in tuples. If I do runLengthEncoding with an empty list I get back an empty list but here's where it gets interesting. If I do runLengthEncoding and I first sort my list of values I now have a tuple of values where all of the fours are grouped together and all of the fives are grouped together.
I have three fours and I have two fives. Now what I can do is I can perform run length encoding on the sorted version of my dataset and then look for whatever tuple has the highest second value. So, this next algorithm computes the mode of a list using the run length encoding function. And here we are using a function called maximumBy and maximumBy is found in the data.org library and it requires that we are comparing based on whatever the second value is that is the SMD and we're comparing on whatever that integer is which as we identified earlier is the length of a sublist.
All our mode function does is sorts the values, passes that data to run length encoding and then finds which element in the list has the highest second value thus representing the mode. So, if I pass in an empty list to my mode I get back nothing and if I pass in mode my example list from earlier in this video I get back just four, three. So, the first doughnut in the tuple will be the most frequently seen element and the second is how many times that element is seen. Four is seen three times, and we've been working with a baseball dataset and we have our away team runs and now we can find which away team run appears most frequently in the 2015 baseball season.
So, mode awayRuns and we see that the answer is two. There were 379 games in the season in which two runs were scored and that is the most frequently seen. So, in this section we recall data stored in a CSV file using the Text.CSV library and we implemented the descriptive statistics functions for the range, mean, median and mode and standard deviation. These functions will become our DescriptiveStats module for future sections.
In our next section, we will begin using SQLite3.
Note: This course was created by Packt Publishing. We are pleased to host this training in our library.
- Data ranges, means, and medians
- Standard deviation
- SQLite3 command line
- Slices of data
- Regular expressions
- Atoms and modifiers
- Character classes
- Line plots of a single variable
- Plotting a moving average
- Feature scaling
- Scatter plots
- Normal distribution
- Kernel density estimation (KDE)