From the course: The Data Science of Sports Management, with Barton Poulson

Moneyball

- [Instructor] Any discussion of sports and data science has to begin with Moneyball. That curious word comes from a 2003 book by Michael Lewis which was then turned into a feature film in 2011. It tells the story of the Oakland A's, a Major League Baseball team that like all teams, wanted to do better, but more than other teams had limited resources to hire the best players. And so the general manager of the A's, Billy Bean made an unusual decision. Instead of relying on the conventional wisdom that a player's ability was directly correlated with their salary, he would try to find players that were undervalued and that performed better than their current salaries would suggest. It also meant offloading overpriced athletes. But to do that, to find hidden value, he also needed to evaluate performance in new ways, which lead him to adopt new quantitative measures of performance as opposed to the standard statistics and scout's intuitive evaluations. This approach paid off. In 2002, despite having a payroll that was about one third that of the New York Yankees, 44 million to 125 million, both teams finished their regular season with an extraordinary 103 wins each. And well, the Los Angeles Angel's of Anaheim ultimately won the world series that year, the shocking success of the A's new approach led to its quick adoption throughout Major League Baseball and then eventually to other sports as well. So, what exactly is Moneyball? Well, the short version is that it's rigorous quantitative analysis in sports. So, data science in sports. Specifically it involves novel data sources and statistical summaries to evaluate player performance and team strategies. The general idea is to search for undervalued performance. Try to find a bargain or try to find a competitive edge. Now it's not something that originated with Billy Beane and the A's, it's based on work by Bill James, a statistician, as well as a Baseball writer and historian, and in many circles it's known by it's name Sabermetrics, which derives from the acronym for the Society for American Baseball Research. Now I can also tell you what Moneyball is not. It's not the box score like this one. This is from the Brooklyn Dodgers and the New York Giants in 1951, that's 66 years ago. And it's not this one. This is essentially the same report. This ones from an 1876 game, that's over 140 years ago, between the Boston Redcaps and the Philadelphia Athletics. And then finally what it's not, is it's not baseball stats. It's not batting averages. It's not the earned run average for pitchers. Those are also from the 19th century, developed by a statistician Henry Chadwick in the 1800's. Rather it is new work, it's new data and new methods. Now, here's the motivation even though baseball's a very traditional oriented sport and these statistics have a very long tradition, there's a couple of important motivations for going to something new. Number one, as the Oakland A's tried to do, that is do better with less money, that's the Oakland A's. And it worked for them. They were first in their division five times since 2002 when they went all in with the Moneyball approach. And then as you might guess, if other teams do it, you can do even better with more money, which is the case of the Boston Red Sox. They unsuccessfully tried to hire Billy Beane, but they did hire Bill James, the father of Sabermetrics, and then after the 85 year curse of the Bambino in which they never won the World Series, they've won it three times in 2004, 2007, 2013. And now, not surprisingly, every single team in Major League Baseball uses this approach to try to get a competitive edge. So, if you want to do this, you're first going to need some data. And what kind of data do you use for Moneyball or properly Sabermetrics? Well, you're going to use traditional box scores and baseball records, you do use those. They work well because baseball has only a certain number of discrete game configurations. People are either on first base, or second, or third, as opposed to basketball or football where they can be anywhere. And so it's easy to count things. There are also massive digitized records. Retrosheet is a project in which many volunteer contributors have researched historical sources, both quantitative records like the box scores and play-by-play narratives, like the ones given by Harry Caray, of every game in baseball going back to 1871, and putting them in digital format for download or analysis by people who are interested in Sabermetric projects. And then there's new data from systems like Pitch FX and more recently Major League Baseball Stat Cast system. Those measure an extraordinary range of actions like the spin rate of the pitched ball, the exit velocity of a hit ball, or the exchange time of a fielder catching and then throwing the ball. In addition to these new sources, there are new measures of performance. And here's just a sampling of some of the measures. So for instance WRC plus, that stands for weighted runs created compared to the league average, it's a good overall measure of a player's hitting, or ISO, that's isolated power, it's a measure of a hitter's extra base power getting past first base. DIPS is defense independent pitching statistics, there's several there. Those measure a pitchers effectiveness based only on statistics that don't involve fielders except the catcher. There's WAR, that's wins above replacement player. And for hitters this encompasses defense, hitting, and base running, for pitchers it encompasses defense, independent pitching, and leverage on the bullpen. There's also DRS for defensive runs saved. This is the number of runs a players saved or cost his team on defense relative to an average player. And so you can see that all of these are more complicated than simply the batting average or number of home runs, but what they do is they make this new analysis possible. And once you get this data, there's some things you're going to do with it. First off, you're going to do just basic descriptive models. That does say how many runs did a person get? What was their WAR score? And this is usually done with a lot of SQL, structured query language from databases, where you're simply pulling out values, adding them up, dividing, and getting these sums and these ratios. This has been where most of the value has traditionally been found in Sabermetrics over the last decade or so, but as more and more teams adopt it, the competitive edge disappears a little bit and so you have to get more sophisticated. Which means then you can start taking data, especially this very fine level data from the video capture systems and start doing predictive modeling. The easy progress has been made and so to find value you have to use these more diverse datasets and machine learning algorithms to generate these predictive models, both for players and for plays on the field. And so now Moneyball is everywhere. The general approach is of rigorous quantitative analysis in sports have been used successfully in basketball, in football, in soccer, and nearly every other sport, showing that here is a lot to be gained, both in terms of performance and in terms of profitability by using data carefully and rigorously in a sports setting.

Contents