From the course: 11 Useful Tips for Regression Analysis

Weighted regression

From the course: 11 Useful Tips for Regression Analysis

Start my 1-month free trial

Weighted regression

- [Instructor] The European commission conducts a Standard Eurobarometer survey multiple times a year. Basically it's a face-to-face interview regarding citizen perception of ongoing issues. It's given to approximately 1000 citizens of each European country with the exception of a few. However, 1,000 citizens represent a different percentage of the population for each country. For example, that's only 0.0015% of France's population, but almost two and a half percent of the population of Monaco. And as you can imagine, this type of sampling method could to an over sample or under sample of particular groups. What a perfect use case for the concept of weighting? Most weights in data come in two forms, frequency weights or probability weights. Frequency weights are integers, and tell your software how many times an observation should be counted. A weight of 10, means the observation should count for 10 observations. Basically it's a 10 times multiplier. Probability or sampling weights, represent the probability that an observation will select it into the sample from a population. They're often transformed to inverse probabilities. For example, a person that has a probability of being selected for a survey of 0.1, would have a sampling weight of 10. This observation is then representative of 10 similar subjects. Higher inverse probabilities indicate a lower probability of selection and more weight. Now, you might be wondering, do regression models need weights? Well, yes and no. The argument is that, if you have a well specified progression model, then the inclusion of weights becomes less important. Since our explanatory variables control for all important differences anyway. For example, in the Eurobarometer survey, we might be able to use population size as a control variable in our regression. However, in that case, we must be confident that we have the right choice available and the right functional form. In addition, weights are often generated via complex statistical models and can include variables like you won't have access to. For example, many survey weights include detailed geographic information that will remain hidden to users so that they can't identify personal information. In such cases, you combat these variables to regression, unless you use weights. What this means in practice is that even in regression models, you should generally always use weights if they are available. For univariate statistics such as averages or standard deviations, you should always use the appropriate weights. Let's go and have a look at an example. Here we are on Stater and I've already loaded a training data set called census. Let's take a look at what's in the Stater by using the described function. This data set contains census information on US states from 1980. It records a number of deaths, births, marriages, divorces, and the median age by state. And because of that, it only has 50 observations. So this is quite a small dataset. We also have information on a state's population. Now, let's go ahead and run a simple regression. Let's say we want to explain the number of births the state has, but the median age of that state, and also my how many marriages and divorces take place in that state. Our regression model would look something like this. The results indicate that the marriage and divorce variables are statistically significant in explaining birth, what's median age, is not. But let's ignore statistical significance for a moment and focus on the coefficient of median age. Our model says, that a one year increase in a state medians age, decreases the number of births in a state by 569, holding the other variables constant. But of course, different states have different population sizes. We're currently treating every state as an equal and that is not correct. We need to weigh this regression using population data. So let's go ahead and do that. In Stater, we can weigh regressions by adding the weight type and weight variable in square brackets to regression command lexo. Our results have changed dramatically. Our observation count has exploded to 225 million observations which was reflective of the us population in 1980. All the variables are highly statistically significant and the effect of median age is now larger. Each one year increase in median state age, decreases birth by 1,114. So that's a big change and I argue that this is a more accurate estimate because it takes the population size of each state into account. Weights are not just useful for summary statistics or regression statistics, but they can also be used to visualize data. For example, we can plot the relationship between median age and birth, on a scatter plot and weigh each state by its population. And here we can see why our regression results changed. Some states are significantly larger than other states. And those large states would have pulled a regression slope in a different direction. So, as you can see the case for using weights actually carry some weight use them if they are available and always compare your weighted and an unweighted regression results carefully.

Contents