Join Barton Poulson for an in-depth discussion in this video Big O, part of Data Science Foundations: Fundamentals.
- [Voiceover] An important topic in data science, in order to avoid getting yourself in trouble, is big O and function performance. This is important when times is of the essence. The deal is this, functions that you can perform on your data vary in speed. The growth rate of a function, that is how much more time it takes, is called it's order. That's why it's called big O, O is for order. Big O describes the growth rates and find out that differences in growth rates can be dramatic and sometimes counter-intuitive.
So, for instance, some functions are very speedy. You have what's called O(1), that's a constant determinant for binary numbers, even or odd. O(log(n)) that's a logarithmic function. Or O(n) a linear function. Again, those are all pretty quick. Moderate includes O(n log(n)). That's something that you would use for a fast four year transform and actually is a very fast shortcut for what it would be otherwise. And then in terms of slow, and when I say slow, I mean extraordinarily slow, there's O(n) to the square, that's a quadratic.
And then there's O, two to the N, that's exponential. And then O, N factorial, that's about the slowest thing you could ever do. I'll give you an example of how this works. Some functions are good. Insertion sort, that's where you're sorting items in the list in place. Now, in best circumstances, an insertion sort is very fast, it's got an order N, that's good. On the other hand, insertion sort on average is really slow, so it can vary from this linear to the quadratic.
On the other hand, a very similar sounding process, the selection sort, is at best slow and at average slow. And so you get this big variability in insertion sorts, or you get very little variability in selection sorts. And so it doesn't feel like they should be different, that's why it's important to know what you're getting yourself into when you decide what functions you're going to use. I'll you give you a short example of how this works in R. What we're going to do here is actually graph the speed of these various functions.
And I'm going to be using a package called RcolorBrewer to color things. It's a wonderful thing if you're making graphs. I'm going to load that up. And then I'm going to set up a graph here. I'm going to put in the limits for the plot, for X and for Y. Then I'm going to choose a particular pallet from RcolorBrewer. Then I'm going to graph my functions. First, here's the fast one, the constant. And what I'm going to do is I'm going to plot that one using these various parameters here.
And now you can see, I'll zoom in on that, it's constant. It doesn't matter how many elements you have, it always takes exactly the same amount of time. A logarithmic function takes a little longer. We'll plot that. And you can see, it goes up, but it goes up very slightly. A linear function takes a little bit longer. Now, I know that looks like a big increase, but really it's nothing compared to what we're going to see. The moderate ones, the log linear, again, it feels a lot slower, but for what it's doing with a fast four year transform, it is amazing.
So that's a lot bigger. so for instance, for this one, if you're doing your operations and you have just 10 elements, it's going to take about 23 times as long as a constant operation. Then we get to the slow ones, the quadratic. You can see how that one just goes straight up through the roof. Then the exponential. If you're familiar with the traveling salesmen problem, this is the faster way of solving it using what's called dynamic programming. That's very high.
And then the worst one of all is factorial. This is when you're trying to solve something called the traveling salesmen problem using just a brute force technique where you're searching every single possible combination. The amount of time that's going to take is about right here you can see that's practically straight up. In fact, if we come back here, the green line, let me put a legend on this, by the way. And I'm going to put a little vertical reference line at 10 elements. So I've got my dotted line here. Now 10 elements constant takes just, you know, one unit at a time.
At the logarithmic it takes a little bit longer, but if you go to the factorial it's going to take over three and a half million times as long. So, unless you're very very patient you want to know, again, whether the algorithm you're using, the function you're going to use, is going to be anywhere close to efficient in terms of time and computational resources. So, our conclusions from this brief demonstration are these, functions vary dramatically in speed. They can also vary unexpectedly by task.
And you always need to be aware of the computing demands that are going to be a function, both of the size of the data set and the particular tasks that you're trying to execute.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data