From the course: Advanced and Specialized Statistics with Stata

Data generating process (DGP) - Stata Tutorial

From the course: Advanced and Specialized Statistics with Stata

Start my 1-month free trial

Data generating process (DGP)

- [Instructor] In this session, we're going to explore how we can use repeated random number generation to influence something called the data generating process, often referred to as DGP. The data generating process allows users to build custom datasets with known statistical properties from nothing. For example, we can create continuous categorical variables from nothing and also specify what kind of relationships these variables might have with each other. This in turn, allows us to examine the properties of statistical estimators. There are no new commands of functions we're going to introduce in this session. Instead, we're going to use the previously introduced rnormal and runiform functions to generate random variables out of thing air and then specify a relationship between these variables and other variables. Finally, we'll try to estimate this predefined relationship using Stata's most popular estimation command, regress. So let's head over to Stata. Here we are with an empty dataset and an empty do file. Let's begin by calling the set observation command to tell Stata to increase our sample size from zero to 1000. Set obs 1000. Run that. Okay, next, let's generate three new random variables. Two x variables using the uniform distribution and one e variable using the normal distribution. So let's type gen x1 equals to runiform. Gen x2 equals to runiform. And finally, gen e1 equals to rnormal. The idea behind the name of these variables is that the x variables will mimic observable explanatory variables in what we do next and the e variable mimics an unknown, observed parameter. Let's execute all three. Finally, let's generate another new variable called y, which is a function of the previous variables. So for example, gen y equals to one plus one times x1 minus two times x2 plus one times e1. Execute that. In this case, the relationship between y and the other variables is as follows. Y equals to one plus one times x1 minus two times x2 plus one times e1. The full command line said generate these variables on the data generating process. And the important thing is is that we know the properties and relationships with each other exactly. So now let's assume for a moment that we live in some sort of super reality where we can observe everything. In other words, we can see the error term, e1. So let's see what happens if we now run a regression of y against x1, x2, and e1. So let's run regress, y against x1, x2, and e1. Execute that. So in this case, applying ordinarily squares for regression to the Stata will return an exact fit. Look at that, we have a nice squared of one. Our regression manages to hit every single data point exactly. In the real world, this is probably not very likely. So now let's assume that there are somethings we can't see. Specifically, let's assume that the error term is hidden away from us. So let's modify our regression, remove the error term and re-estimate it. We have now estimated a relationship between y and x1 and x2, and we see that we recover through roughly the right kind of numbers. We estimate a coefficient of approximately one for x1 and a coefficient of approximately two for x2, and a constant of approximately one. Great, if our data really was shaped like this, the ordinary squares estimator would appear to recover the underlying relationship in our data pretty well. So as you can see, generating random data and seeing how estimators perform on such data is very easy in Stata.

Contents