From the course: Executive Guide to Predictive Modeling Strategy at Scale

How much data do I need?

From the course: Executive Guide to Predictive Modeling Strategy at Scale

Start my 1-month free trial

How much data do I need?

- [Instructor] So now, we're going to cover some critical concepts that come into play whether your data set is large or small. So, how much data is just enough? Well, let's say you're trying to predict churn, a good rule of thumb is that you should have more than a thousand cases that churned that you can use in training the model, but don't forget, you also need non-churns to compare them to. A thousand of them would be helpful, and also, you need that test data set. Remember, we'll have our train and test partitions. So, the minimum really starts to add up. It might surprise you how often you start to flirt with these minimums. Too little is more common than too much. So, here you can see that to meet those very basic requirements, 4,000 is a nice, round number on the low end. As you can imagine, 4,000 total, isn't the hard part. Sometimes, it's the 2,000 churns because they are more rare than the non-churns. We don't need to go deeply into the theory behind this, but many data scientists advocate having a third data set. It's a kind of double check, but we can see that while this can be helpful, it's tough if we're studying something like aircraft engine failures. We might not have 3,000 of those, so if you've got a little bit less, first of all, you almost certainly won't worry about the validation data set if you don't have your 3,000. You can live without that. The next thing that you can sacrifice if you're on the low end, is you can keep the trained data at a thousand plus, but pull back a bit on the number and the test, and this, within reason, will work just fine, but now what do you do if you still don't have enough? If you have less than 1200 churns, things start to get a bit more complicated and the data scientist has to start looking into options. It can be done by starting to dip into their bag of tricks. Over the last two decades, this exact scenario has been present in at least 20% of my projects. It's not as rare as you think.

Contents