From the course: Advanced and Specialized Statistics with Stata

What is survival data? - Stata Tutorial

From the course: Advanced and Specialized Statistics with Stata

Start my 1-month free trial

What is survival data?

- [Instructor] In this chapter we're going to explore the concept of survival analysis. Sometimes this is also called event history, or duration analysis. Survival analysis is used when a dependent variable in a regression is time, t. Time can be measured in seconds, minutes, or hours, or even in unknown units. The key thing is to be aware that we do not usually use ordinary least squares estimation when time is the dependent variable. There are many reasons for this, but a key reason is that ordinary least squares assumes that time is distributed normally. Time is not normally distributed and it can't take negative numbers. So we need to apply a different methodology to ensure that we don't get any silly predictions like negative time. Another important point is that survival analysis often contains graphics. It doesn't have to, but generally there's a strong visual element to this kind of analysis. Here's an example of survival analysis. This study started with 100 people and after a certain amount of time only 40 people were left in the study. After even more time only one person was left. The research questions that we are interested in are what is the shape of this function? How do we estimate this function? How many people survive after a particular time? And what is probability of surviving beyond a certain time point? We'll seek to answer these questions using a variety of different methodologies. What kind of methodologies exist in survival analysis? There tend to be three main methodologies in survival analysis, nonparametric, semi-parametric, and parametric. Each imposes different amounts of structure on the data, and we'll explore these in detail in separate sessions later. But in layman terms they can be explained as follows. Nonparametric analysis lets the data speak for itself. It makes no assumptions. You see what is there. On this graph, you can see the survivor function for two types of patients, and this is pretty much just a reflection of the underlying raw data. We might think this is a good approach. But a disadvantage of this approach is that if it's not there we don't see it and that can be an issue if you want to predict outside our sample range or if our sample is very small. And we will quickly end up with very small samples if we want to investigate our results over many variables. So that can be a problem. One solution is to mix nonparametric analysis with parametric analysis, and we call this semi-parametric analysis. Here we let the raw data compute one core survivor, or hazard function, where all variables are set to zero. A variable such as patient type or gender will then simply move this underlying function proportionally up or down. In other words the effect of variables is that they do not change the function itself, but simply squash what's called a baseline function up or down. So therefore all variables are proportionally related to each other. This is shown in this graph where both types of patients follow roughly the same kind of survival pattern, but on different levels. The final type of analysis is called parametric analysis, and this specifies the kind of slopes the survivor or hazard functions can take from various distributions. This kind of model imposes a lot of form on the data and your results. The effect of variables is similar to the semi-parametric model in that they move the survivor function up or down but this time as a function of the underlying distribution. However, all variables remain proportionally related to each other. So which should you choose? If you're uncertain of what to do, or you're exploring your data, then it's usually best to go with a nonparametric or semi-parametric model. But, if you know exactly what you want, or you have a lot of covariates to analyze and you understand the assumptions that different distributions impose on your data, then parametric methodologies can offer a significant efficiency advantage.

Contents