- [Instructor] Baseball was the first major sport to go through a data revolution, as was so vividly described in Michael Lewis' book Moneyball and the movie of the same name. The reason baseball was first is 'cause it has a long tradition, over 100 years, of counting everything. So, they had the data all wrapped up in nice, complete databases that allowed analysts to look for a hidden value and hopefully get a competitive edge, and it's been enormously effective. Just ask the Boston Red Sox who, after an 86-year drought, have one the World Series three times since they adopted this data-intensive approach in late 2002.
But what about education and how does this apply to teaching, learning, and educational management? Well, data in education is kind of a different thing. There is some passive data from attendance and graduation rates. By passive it means, you don't have to do anything extra to get the data, it's just there in the records, and that's the way baseball data is. They play the game, it simply shows up in the records. On the other hand, in education, most data comes from standardized testing, and the problem is that test preparation and test administration are time-consuming, to put it mildly, and expensive.
And so, the average American student, for example, from pre K to 12th grade will take about 112 mandatory standardized tests, and it's not evenly distributed across grades. When a student is in a testing year, they can spend up to 100 hours of class time preparing for the tests and another 50 taking them, which is about 15% of the roughly 1,000 hours in the class room each year. In addition, the cost associated with test administration can be as high as $800 per student per year, not counting the cost of lost instruction time.
And so, yeah, this is very active process to gather the data in education, and really, it's kind of intrusive. Now, it wouldn't be such a problem if it made a really big difference, like the data did for the Red Sox. But more testing doesn't seem to generally be associated with producing more learning, so there's a really big mismatch there and it lets you know that data in education is an area that really is ripe for some significant innovation, and data science can be one method that can really bring some new energy and utility to analysis in education.
There are a few reasons for this. Number one, data science allows researchers to use more diverse data. Now, big data and data science are different things, but you may know about big data, they have the three Vs. They have volume and velocity and variety, where variety refers to the different kinds of data. Not just a nice structured database, but bringing in free text, bringing in images and audio and video. Data science methods allow educational researchers to bring in an enormous quantity of diverse datasets.
In addition to that, it allows researchers to use a lot more passive data. Again, that means data that already exists and doesn't take any extra effort. So, not just attendance records, but say for instance, open remarks on grades on exams by a teacher or things that people post online or information about how much time students spend on the computer in a particular class, how many times they have to repeat a quiz question. That's all data that can be incorporated if you have the right methods for using it in the analysis, and that's what data science makes possible.
The third is, because data science really has a very strong association with the business world in terms of e-commerce and commercial social media marketing, data science has a tradition of a very strong focus on prediction. Trying not to simply describe what's happening or knowing why it's happening, but really, what's going to happen next and what do we need to do? And also, a strong focus on ROI, or return on investment. Now, I'm not saying that we're talking about financial investments in education, though there is an element of that, that's part of the accountability equation of education, but mostly I'm trying to say, are you getting the most return for the time and the energy that you put into teaching or planning or managing? And that data science has this focus on prediction and high-impact activities that could be very helpful in education.
Also, data science, because it's focused on prediction and because it's focused traditionally on individual consumers and trying to get them to take individual actions, it tends to get a very nuanced and individualized approach. You can bring in a lot of context information, you can bring in a lot of historical information about a single individual and make recommendations and predictions that are specific to that one individual or to a micro segment, and that gives a lot of added utility in educational research.
Finally, data science methods are designed to be updated rapidly and to do so at scale, at large volumes. This is very different from a standard research project that can take a year or two to conduct. The idea here is that you can update it possibly even every day. And as things change for a student or for a classroom or for a school district, then the evaluations, the programs, and the recommendations can be updated in near real-time. Now, I wanna mention a little bit about the actual practice of data science.
Please remember, the focus of this course is not technical, I'm not here to show you how to actually conduct these things, we have other course available for that. Instead, this is conceptual. It's an overview of what is possible in data science. I will mention, there are some common methods used in data science in education and anywhere else. They start with standard, familiar regression models that are used in a lot of fields. They're very powerful, they're very flexible, and I would always recommend that a person try using them.
But there's a lot more that you can do than that. So, for instance, some of the best predictive research, it comes out the use of Bayesian models, or models that use information from previous sources to get what's called a prior probability, which is then explicitly integrated into the new model to get updated probabilities. Also in data science, there are techniques like decision trees and a ensemble or collection of decision trees that's called a random forest. These are nonparametric methods that work very, very differently from, say, regression models, but are also easy to interpret and can give really precise segments for decision-making and prediction.
If you wanna get more sophisticated, it's possible to use neural networks and the variation deep learning models, which have been enormously influential in data science recently. All of these fall into the general rubric of machine learning and artificial intelligence, but it doesn't mean that you always have to fire up some whole server farm and you gotta do something huge. Again, a regression model, which can be done on a single computer, is an excellent first start. But these are some of the methods that I will be referring to in this course, but again, I'm speaking conceptually here, so I want you to be aware that these things exist, and when I talk about data science methods, this is often what I'm referring to.
But taken all together, the idea here is that, as we apply data science to education, maybe we can bring the success and the creativity of the Moneyball approach to education. We can use data science to help plan curriculum, to allocate classrooms, to create schedules, to track the progress and engagement of students, to predict problems and intervene before they become serious issues, and to conduct more context-sensitive evaluations of educational programs. And hopefully, all of this will allow students to spend more time learning, less time testing, allow teachers to spend more class time on activities that will have the highest learning impact, and allow schools to have more flexibility in working proactively to meet their community's needs.
And in that way, we can bring the extraordinary success of the Moneyball approach from baseball to education.