Start free trial Sign in

From the course: Introduction to Stata 15

Distributional analysis (numerical) - Stata Tutorial

From the course: Introduction to Stata 15

Start my 1-month free trial

Distributional analysis (numerical)

“

- [Narrator] In this session, we're going to explore initial distributional analysis. When we encounter continuous variable such as price, income, or earnings, we often want to understand more about the distributional properties of these variables. Do they have a bell shaped distribution or do they have a skew? Do they have short tails or long tails? Often such questions are explored graphically and we will explore this in another session, but we can also explore this numerically, and that's what we'll focus on here. The commands I will introduce are inspect, which produces a rough histogram and is designed to give you a little bit more distributional information about a variable. We'll also look at summarizing more detail by invoking the detail option, which presents additional distributional statistics on a variable. Finally, we'll use the skewness and kurtosis test, sktest, to formally test whether a variable is normally distributed. So let's head off to Stata and practice these commands on the auto training data. Here are we are in Stata with the auto.dta already preloaded. Let's assume for a moment that there is a continuous variable that we are really interested in. Let's say price for example. Let's summarize price. A summary of price reveals that it has a mean of around 6000, and a standard deviation of 3000, has minimum value of 3000, and a relatively high maximum value of nearly 16,000. If you want to explore the variable in more detail, you can first inspect it, so let's type inspect price. This reveals to us that all values in the variable price are positive. Here, that's good. A price of zero or negative might be a bit worrying for something like a car. We also see that there are seventy four unique values which clearly indicates that this is a continuous variable, as each observation has a unique and distinct value. We can also observe a rough histogram that tells us that a lot of data is bunched up on the left, and there appears to be a long tail going to the right. This confirms our initial suspicion that we have a non-normally distributed variable. We can obtain more detailed statistics on price by invoking the detail option in summary. So, summarize price, and add the option, detail. There's a lot of information here, so let's take a few moments to make sure we understand it. The percentiles tell us what the values are at each percentile. In this case, at the 50th precentile, the value of price is approximately $5000. These values give us an idea of how the variable is distributed. Stata tells us the values for a wide range of percentiles. The column Smallest and Largest denotes smallest and largest observations in this variable. The observations for count, mean, and standard deviation are not new, so we'll skip these. But here in the bottom right, we find additional information on the variance, the skewness, and the kurtosis of price. Skewness is a measure of lack of symmetry of distribution. If the distribution is symmetric, the coefficient of skewness is zero. If the coefficient is negative, the distribution is said to be skewed to the left. If the coefficient is positive, the distribution is said to be skewed to the right. Kurtosis is a measure of peakedness, or tailedness of a distribution. The smaller the coefficient of kurtosis, the flatter the distribution and the thicker the tails. The normal distribution has a coefficient of kurtosis of three, and that provides a convenient benchmark. In our example, we can see that price is a right skew, here, and a kurtosis of above three, suggesting thinner tails than a normal distribution. We can ask Stata to formally test whether the variable price deviates from a normal distribution by calling the sktest command. Sktest price Sktest preforms the Shapiro-Wilk test for normality, and tests both the skewness and the kurtosis independently and jointly. Values below 0.05, in the case, that we reject the normality assumption at the 5% level. And we can see that for price, we do significantly reject the hypothesis that it is normally distributed. So there we are, the three commands outlined in this session are very easy to use, and will help you to quickly, better understand the nature of your continuous variables in your data

Contents

- (Locked)
  
  Next steps
  
  57s