From the course: Advanced and Specialized Statistics with Stata

Features of count data - Stata Tutorial

From the course: Advanced and Specialized Statistics with Stata

Start my 1-month free trial

Features of count data

- In this chapter, we're going to learn how to analyze and model Count Data. Count Data is data that is generated by a process that results in only non-negative integers. In other words, data of this type can only be zero or positive and must increment in whole numbers. For example, zero, one, two, three, four, etc. There are many examples of count data in real life. For example, how many hurricanes hit a region per month, how many crimes occur in a block per week, or how many children a family has. These are all count data. Here's a famous historical example of count data and it's analysis. This data stems from somebody called Ladislau von Bortkiewics in 1898, he was interested in analyzing small numbers. And in this example the number of deaths that were caused by horse-kicks in the Prussian Army in the late 1800's. Apparently, this was a real issue back then. You can see that mostly no people died per month, however one death per month was observed 65 times. And two deaths per month was observed 22 times. This is a classic example of count data. We're going to come back to this numbers a little bit later. Do we need special models for count data regressions? The short answer is yes, many count data distributions are positively skewed rather than normally skewed. As required by ordinarily squares. They also often contain many zero's, which makes transforming the data impractical. This is because multiplying something by zero will simply produce zero, and therefore not transform anything. Finally, ordinary squares regression may predict negative counts. Which is something that cannot happen in reality, therefore we need different type of models to accommodate these factors. To explore some of these issues a little bit further, we're going to use Stata Commands and functions from the simulation chapter. A new function we'll use in this session is the rPoisson function, which generates a random count variable that is poisson distributed. What also uses ordinary square progression on a count pendent variable and then used to predict commmand to see what happens. Let's head off to Stata. Okay, here we are in the empty data set, do you remember the horse kick data from earlier? This data from the 1800's was shown to be poisson distributed by von Bortkiewics. He calculated that this data had a mean of 0.61. There were roughly 200 observation in that data. So let's generate 200 random observations that are poisson distributed with a mean of 0.61, we can do this by typing set observations to 200, generate the variable count equal to rpoisson, with a mean of 0.61, and next lets tabulate the variable count. This is random data, but with a little bit of luck, it should look very similar to the original data from 1898. There are many zero counts, and around sixty ish one counts. And that should be around twenty to twenty-five two counts. This data mimics the real data we saw earlier. This is randomly generated count data, but you can see that the poisson distribution played a important part in generating this data. We'll explore this more in the next session, let's have a look at some descriptives of the count variable. Summarize count and let's invoke the detail option. If we summarize the count data, we see that the poisson generator distribution has a mean of 0.62, approximately. Interestingly, the variance of this count data is the same as the mean, 0.62. This is called equidispersion and it's a principle that may be violated with real data. We'll come back to this in another session. Next, let me show you what happens if we run an ordinarily square regression model through count data. Let's load up the smoking data set that comes attached to this session. Clear, use smoking, describe. In this data set we have various variables that relate to smoking and how many people die, and what the agecat is. A natural analysis would be to ask how smoking and age is related to the number of deaths, we might want to use ordinary squares to estimate this. So, let's type the following, regress the number of deaths by smoking and age category. Okay, so our model tells us that smoking is a significant predictor of dying, what's the age category, hm, the results are so and so. But next, let's obtain the predictions from this particular model. We can do that by typing predict count. And lastly, let's tabulate the predictions. And here is the problem, we predicted a negative count. that doesn't make sense, we can't a have negative number of deaths, and that is one of the reasons why we want to use counts specific models, such as the poisson regression model.

Contents