When to use Survival Analysis instead of Regression?

“And what is a Survival function?”, I heard you asking

Tarek Amr
5 min readMar 14, 2024

--

Regression is an important item in every Data Scientist’s toolbox. You probably use it all the time; and you should. But sometimes it isn’t the right tool for the job.

Here is an example to show you why:

You want to predict the height of a golf ball after 5 meters, given the player’s technique, the strength of their shot and, say, the wind speed. How would you estimate it?

Regression, you say?

Correct!

Now, what if I told you: for some reason, there was a fridge in the way of your golfers.

Image created by the author
The fridge is blocking the golfers’ way — Image created by the author

You wish that nasty fridge wasn’t there, it has nothing to do with your experiment.

Too bad, you have to deal with it.

And sorry, you won’t get the chance to re-do this experiment. Those colorful golfers are too busy now to help you again.

Now you have to think out of your toolbox.

Option One: I am still using Regression anyway

Alright, you sure can use regression, but you have to exclude the purple and the pink golfers. The fridge was in their way. Only the green one is left.

Too few samples to base your regression model on.

Not a good idea!

Option Two: Adjust the problem to fit my experiment

Rather than predicting the height of the ball, how about predicting the trajectory of the ball?

Rather than a point estimate (height after 5 meters), you are now going to predict a function h(d) that represents the height of the ball, h, versus all possible distances, d.

This way, not only the green golfer will give us data points to train on, all the way up to 5 meters, but also the purple and the pink golfer will provide us with additional data points. Not up to 5 meters though, but maybe up to 2 or 3 meters. Not bad, if you combine all these information together to estimate your h(d).

The good thing about this approach, is that by knowing h(d) you can use this function as follows: You give it any value of d and get back the expected height at the given distance.

This my friend is the key concept behind survival analysis, but instead of the nasty fridge you have the impatient ticking of time.

Survival analysis was invented to estimate the effect of medicine on patients.

You give a group of patients a medicine, and you want to know how long they will live (survive) afterwards.

I am not sure about you, but I am not patient enough to wait for 80 years for all the patients to die to conclude my experiment.

Thus, rather than a point estimate (time to death), we are now going to predict a Survival function S(t) that represents the probability that a patient survives until time t.

How Survival functions are estimated?

Alright, all is fine and dandy so for, but how to estimate this survival function in time, S(t)?

There are multiple approaches to this, and we can divide them into parametric and non parametric approaches.

Parametric Survival Functions

In this approach, we assume that the survival function takes a specific shape.

We know all patients were alive at the beginning of the experiments, i.e. they are all still survival at t=0, and thus S(0) = 1. We also know that more patients die over time, so S(t) decreases over time till it reaches zero, where no patients survive anymore. One function that looks like this is the following exponential decay function:

Exponential survival function

Now, all you need to do is to use all the data points you collected to get the value of λ that makes this functions fits your data, i.e. curve fitting.

There are multiple functions that are typically used in survival analysis other than the exponential function.

If you think of probability distributions, their CDF (cumulative distribution function) goes from 0 to 1, this is the opposite of survival functions which go from 1 to 0. That’s why survival functions are defined as follows: S(t) = 1 — CDF. And the following distributions are commonly used to derive survival functions from their CDF’s: Weibull, Gamma, Log-Normal, and Log-Logistic.

The main problem with this approach is that we make assumptions about the shape of the survival function. You already assumed that the survival function will like like an exponential decay and you want to force that function to fit your data. What if it doesn’t fit it?

Non-Parametric Survival Function

This other approach doesn’t make assumptions about the shape of the survival function. For example, in the Kaplan-Meier approach, we basically build a step function from the collected data points. But because it build the function empirically it cannot simply extrapolate beyond the collected data points. This is something parametric approaches can do

Besides those two statistical approach, you can also think of machine learning as an alternative method. Any model that predicts a curve instead of point estimates will do the trick for you.

Finally, what is this Survival Analysis good for?

Beyond Golfing and Dying Patients

You don’t have any patients to give meds to, and you are lucky enough not to have any fridges in your gold court. Why do you still need to learn about the Survival Analysis still?

The same concept can be used to deal with a lot of problems where data is not complete due to whatever reasons. This problem is usually known as right censoring. For example, you want to measure the churn-rate of your customers, the rate of returns of your items sold or the life-time value of your shoppers. In all these examples, you may wait ages to collect your target value, but the business won’t wait for you to come to your conclusions, and thus, the Survival Analysis is here for your rescue.

--

--

Tarek Amr
Tarek Amr

Written by Tarek Amr

I write about what machines can learn from data, what humans can learn from machines, and what businesses can learn from all three.

Responses (9)