Introduction to Survival Analysis

Survival analysis examines the time until an event of interest occurs. In bioinformatics and medical research, this typically involves analyzing patient survival, disease progression, or treatment response over time.

What is Survival Analysis?

Survival analysis addresses questions like:

How long do patients survive after diagnosis?
Does a specific gene expression level predict survival?
Which treatment leads to better outcomes?
What factors influence disease-free survival?

The term “survival analysis” comes from medical research, but the methods apply to any time-to-event data, such as time to tumor recurrence, time to achieve remission, or time to disease progression.

Key Concepts

Survival Time

The duration from a defined starting point to an event:

Start: Diagnosis, treatment initiation, birth
Event: Death, relapse, progression, cure
Time: Days, months, years

Censoring

A critical concept when the event hasn’t occurred for all subjects:


import pandas as pd
import numpy as np
 
# Example survival data
data = {
    'patient_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
    'time': [24, 18, 36, 12, 30],
    'event': [1, 1, 0, 1, 0]  # 1 = event occurred, 0 = censored
}
survival_df = pd.DataFrame(data)
 
print(survival_df)

Types of Censoring:

Right Censoring (most common)
- Event hasn’t occurred by end of study
- Patient lost to follow-up
- Patient withdrew from study
Left Censoring (rare)
- Event occurred before study began
Interval Censoring
- Event occurred between two observation times

Survival Function S(t)

Probability of surviving beyond time t:

$$S(t) = P(T > t)$$

Where:

S(0) = 1 (everyone alive at start)
S(∞) = 0 (eventually everyone experiences event)
S(t) is non-increasing

Hazard Function h(t)

Instantaneous rate of event occurrence at time t:

The hazard represents the risk of event at each time point.

Setting Up Survival Analysis

Install Required Libraries


pip install lifelines pandas numpy matplotlib seaborn

Import Libraries


from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test, multivariate_logrank_test
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

Prepare Data

Survival data needs at minimum:

Duration: Time from start to event or censoring
Event: Binary indicator (1 = event, 0 = censored)


# Create sample clinical data
np.random.seed(42)
 
n_patients = 100
 
clinical_data = pd.DataFrame({
    'patient_id': [f'P{i:03d}' for i in range(n_patients)],
    'time': np.random.exponential(scale=20, size=n_patients),
    'event': np.random.binomial(1, 0.7, size=n_patients),
    'age': np.random.normal(60, 15, size=n_patients),
    'stage': np.random.choice(['I', 'II', 'III', 'IV'], size=n_patients),
    'treatment': np.random.choice(['A', 'B'], size=n_patients)
})
 
# Round time to integers (months)
clinical_data['time'] = clinical_data['time'].round().astype(int)
clinical_data['age'] = clinical_data['age'].round().astype(int)
 
print(clinical_data.head())
print(f"\nTotal patients: {len(clinical_data)}")
print(f"Events: {clinical_data['event'].sum()}")
print(f"Censored: {(1 - clinical_data['event']).sum()}")

Basic Survival Analysis

Kaplan-Meier Estimator

The most common non-parametric method for estimating survival function:


from lifelines import KaplanMeierFitter
 
# Initialize fitter
kmf = KaplanMeierFitter()
 
# Fit the data
kmf.fit(
    durations=clinical_data['time'],
    event_observed=clinical_data['event'],
    label='Overall Survival'
)
 
# Plot survival curve
plt.figure(figsize=(10, 6))
kmf.plot_survival_function()
plt.title('Kaplan-Meier Survival Curve')
plt.ylabel('Survival Probability')
plt.xlabel('Time (months)')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
 
# Print survival statistics
print(kmf.survival_function_.head(10))
print(f"\nMedian survival time: {kmf.median_survival_time_:.1f} months")

Comparing Groups

Compare survival between different groups (e.g., treatments):


# Separate by treatment
treatment_a = clinical_data[clinical_data['treatment'] == 'A']
treatment_b = clinical_data[clinical_data['treatment'] == 'B']
 
# Fit KM for each group
kmf_a = KaplanMeierFitter()
kmf_a.fit(treatment_a['time'], treatment_a['event'], label='Treatment A')
 
kmf_b = KaplanMeierFitter()
kmf_b.fit(treatment_b['time'], treatment_b['event'], label='Treatment B')
 
# Plot both
plt.figure(figsize=(10, 6))
kmf_a.plot_survival_function()
kmf_b.plot_survival_function()
plt.title('Survival by Treatment Group')
plt.ylabel('Survival Probability')
plt.xlabel('Time (months)')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
 
# Statistical test (Log-rank test)
from lifelines.statistics import logrank_test
 
results = logrank_test(
    treatment_a['time'], treatment_b['time'],
    treatment_a['event'], treatment_b['event']
)
 
print(f"\nLog-rank test p-value: {results.p_value:.4f}")
if results.p_value < 0.05:
    print("Significant difference between groups")
else:
    print("No significant difference between groups")

Gene Expression and Survival

A common bioinformatics application is linking gene expression to survival:


# Simulate gene expression data
clinical_data['gene_expr'] = np.random.normal(5, 2, size=n_patients)
 
# Stratify by gene expression (high vs low)
median_expr = clinical_data['gene_expr'].median()
clinical_data['gene_group'] = clinical_data['gene_expr'].apply(
    lambda x: 'High' if x > median_expr else 'Low'
)
 
# Compare survival between high and low expression
high_expr = clinical_data[clinical_data['gene_group'] == 'High']
low_expr = clinical_data[clinical_data['gene_group'] == 'Low']
 
kmf_high = KaplanMeierFitter()
kmf_high.fit(high_expr['time'], high_expr['event'], label='High Expression')
 
kmf_low = KaplanMeierFitter()
kmf_low.fit(low_expr['time'], low_expr['event'], label='Low Expression')
 
# Plot
plt.figure(figsize=(10, 6))
kmf_high.plot_survival_function()
kmf_low.plot_survival_function()
plt.title('Survival by Gene Expression Level')
plt.ylabel('Survival Probability')
plt.xlabel('Time (months)')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
 
# Test difference
results = logrank_test(
    high_expr['time'], low_expr['time'],
    high_expr['event'], low_expr['event']
)
print(f"Gene expression p-value: {results.p_value:.4f}")

Understanding Results

Survival Probability

At any time point, you can extract the survival probability:


# Survival at specific time points
time_points = [12, 24, 36, 48, 60]  # months
 
for t in time_points:
    survival_prob = kmf.predict(t)
    print(f"Survival probability at {t} months: {survival_prob:.2%}")

Median Survival Time

The time at which 50% of subjects have experienced the event:


print(f"Median survival: {kmf.median_survival_time_:.1f} months")
 
# Confidence interval
confidence_interval = kmf.confidence_interval_survival_function_
print(f"95% CI: [{confidence_interval.iloc[0, 0]:.1f}, "
      f"{confidence_interval.iloc[0, 1]:.1f}]")

Hazard Ratios

Relative risk between groups (covered in Cox Regression):


from lifelines import CoxPHFitter
 
# Prepare data for Cox regression
cox_data = clinical_data[['time', 'event', 'age', 'gene_expr']].copy()
 
# Fit Cox model
cph = CoxPHFitter()
cph.fit(cox_data, duration_col='time', event_col='event')
 
# Print hazard ratios
print("\nHazard Ratios:")
print(cph.summary[['exp(coef)', 'exp(coef) lower 95%', 
                   'exp(coef) upper 95%', 'p']])

Important Considerations:

Ensure adequate follow-up time
Account for competing risks if applicable
Verify proportional hazards assumption for Cox models
Consider sample size for statistical power

Common Applications

Clinical Research

Patient survival after cancer diagnosis
Time to disease recurrence
Treatment efficacy comparison

Genomics

Gene expression signatures predicting survival
Biomarker discovery
Risk stratification

Drug Development

Time to treatment failure
Duration of response
Adverse event timing

Next Steps

Kaplan-Meier Curves - Detailed guide to KM estimation
Cox Regression - Multivariate survival analysis
Survival Visualization - Advanced plotting techniques