In this video, the instructor shows how to clean financial data to remove incorrect values and winsorize data.
- [Instructor] All right, now you've gathered your data, you've put it all on a spreadsheet, the question is what comes next. Well gathering the data alone is not enough. We need to clean it up. I like to say that data are a lot like kids, it often gets dirty. Now what I mean by that is that there's all kinds of problems that come up with data, for example, you might have various errors in your data. Transposition errors are very common for instance. A transposition error simply means that one column of data gets transposed with another.
Perhaps stock price has gotten mixed up with gross margins, for example. We need to have a way to go through and clean that up and remove those kinds of errors. Data availability can also change. One classic example of this is SIC codes versus NAICS codes. Now you might not be familiar with either of these, but both types of codes represent the industry that a particular firm is in. SIC codes were used prior to the year 2000 and NAICS codes were used after 2000.
Both of them represent the industry that a particular company is in. Unfortunately, there's no way to go between the two types of codes. These are codes used by the US government for classifying firms, but if you have a company that was started after the year 2000, let's say Facebook, there's no way to go and match that with an SIC code from the 1990s. That data availability can present a big problem. We've also go all kinds of unobservable effects that might come up.
For example, we might be looking at sales forecasts and historical sales data, but we don't have data on customer sentiment in the area or data on what our competitors are doing. Those kinds of unobservable effects can mar our ability to effectively analyze a particular set of data. As a result we need to try and use large sample sizes, use lots of data, that's very helpful, but just as importantly we need to clean up our data through testing.
We need to start by looking for outliers and winsorizing, that is look for extreme values in our data and then either flag those values or delete them. Once we've cleaned up our data set we're frequently going to need to put multiple sets of data together. This sounds really easy, but it's not. We need to start by determining relationships. For example, what should database that contains GDP information and customer information look like? We're trying to match up data on the macroeconomy and gross domestic product with data on our individual customers.
How should we be dealing with that problem? Another issue that comes up deals with data frequency. Different kinds of data have different frequencies. For example, oil price data is available everyday. Jobs data, that is unemployment figures, come out once a week. Auto sales come out once a month. How are we going to put all of these things together and do a single data set? Recall the data that I showed you previously on employee theft.
I'd gone through and cleaned up that data so that we had each employee and whether or not they worked on a particular day. If we're trying to look at different types of data we're going to need to deal with all of these issues. We've also got different types of data that we might be dealing with. For example, we can have time dependent data, data that changes over a period of time. Unemployment rates, for example, or prices. Similarly we have what's called cross-sectional data.
Data that does not change with time, but changes based on some other factor. For example, customers, different customers will have different levels of sales associated with them or different characteristics, addresses for example or geographies where they're located. That doesn't change with time, but it does change by customer. We need a way to merge and clean up all of this data. How are we going to do this? Well to begin with we need to decide on a unit of analysis.
What is it that we're trying to analyze? Are we trying to figure out and project sales to a particular customer in the future? Our unit of analysis should be the customer. Are we trying to figure out and project sales on a given day for our company? Then our unit of analysis should be the date. Whatever our unit of analysis is we need to determine that and then merge our data together appropriately. We also need to find a common variable or feature to merge on.
For example, in data that changes with time we want to merge on date, for cross-sectional data, which if you recall does not change with time, we might merge on something like zip code. Again, once we've merged our data set we need to go back and clean it up again. Nearly all large data sets have some kind of issues that are a result of merging. For example, possible issues that could come up are data entry errors, perhaps a particular data point has simply been entered incorrectly in one of our merged data sets.
In order to deal with this we want to review univariates. That means the means and the medians for each variable. Look at what those means and medians are and if something looks suspicious go through and identify and isolate what data points are driving that change in the mean or median. We'd also want to look for fraudulent data. One of the best tools for identifying fraudulent data is what's called Benford's law. Benford's law was developed in the 1930s by a scientist at General Electric, and Benford's law simply says that in genuine non-fraudulent data we can predict with great certainty the frequency of each digit zero through nine in that data.
In genuine data the number one should be the most common number followed by the number two followed by the number three, etcetera, all the way up through nine. In other words genuine data should have many digits that are small and close to one and few digits that are close to nine. That's just a good check to go through and look at in order to determine whether or not there might be any issues with your data. We'd also what to look for atypical data.
Check for outliers for example. Look at the 99th percentile and the first percentile in our data set. Do those look dramatically different than the mean for a particular variable? If so maybe we want to throw out those data that are outliers or at least flag them so that we can run our regression with and with out them. Again, there's no hard correct answers for how to clean up our data. A lot of this is simply a process of going through and understanding what's going on in the underlying data itself and then making choices from there.
Join Professor Michael McDonald and discover how to use predictive analytics to forecast key performance indicators of interest, such as quarterly sales, projected cash flow, or even optimized product pricing. All you need is Microsoft Excel. Michael uses the built-in formulas, functions, and calculations to perform regression analysis, calculate confidence intervals, and stress test your results. You'll walk away from the course able to immediately begin creating forecasts for your own business needs.
- Understanding big data and predictive analytics
- Gathering financial data
- Cleaning up your data
- Calculating key financial metrics
- Using regression analysis for business-specific forecasts
- Performing scenario analysis
- Calculating confidence intervals
- Stress testing