Introduction to Survival Analysis
Survival analysis examines the time until an event of interest occurs. In bioinformatics and medical research, this typically involves analyzing patient survival, disease progression, or treatment response over time.
What is Survival Analysis?
Survival analysis addresses questions like:
- How long do patients survive after diagnosis?
- Does a specific gene expression level predict survival?
- Which treatment leads to better outcomes?
- What factors influence disease-free survival?
The term “survival analysis” comes from medical research, but the methods apply to any time-to-event data, such as time to tumor recurrence, time to achieve remission, or time to disease progression.
Key Concepts
Survival Time
The duration from a defined starting point to an event:
- Start: Diagnosis, treatment initiation, birth
- Event: Death, relapse, progression, cure
- Time: Days, months, years
Censoring
A critical concept when the event hasn’t occurred for all subjects:
import pandas as pd
import numpy as np
# Example survival data
data = {
'patient_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
'time': [24, 18, 36, 12, 30],
'event': [1, 1, 0, 1, 0] # 1 = event occurred, 0 = censored
}
survival_df = pd.DataFrame(data)
print(survival_df)Types of Censoring:
-
Right Censoring (most common)
- Event hasn’t occurred by end of study
- Patient lost to follow-up
- Patient withdrew from study
-
Left Censoring (rare)
- Event occurred before study began
-
Interval Censoring
- Event occurred between two observation times
Survival Function S(t)
Probability of surviving beyond time t:
$$S(t) = P(T > t)$$
Where:
- S(0) = 1 (everyone alive at start)
- S(∞) = 0 (eventually everyone experiences event)
- S(t) is non-increasing
Hazard Function h(t)
Instantaneous rate of event occurrence at time t:
The hazard represents the risk of event at each time point.
Setting Up Survival Analysis
Install Required Libraries
pip install lifelines pandas numpy matplotlib seabornImport Libraries
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test, multivariate_logrank_test
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)Prepare Data
Survival data needs at minimum:
- Duration: Time from start to event or censoring
- Event: Binary indicator (1 = event, 0 = censored)
# Create sample clinical data
np.random.seed(42)
n_patients = 100
clinical_data = pd.DataFrame({
'patient_id': [f'P{i:03d}' for i in range(n_patients)],
'time': np.random.exponential(scale=20, size=n_patients),
'event': np.random.binomial(1, 0.7, size=n_patients),
'age': np.random.normal(60, 15, size=n_patients),
'stage': np.random.choice(['I', 'II', 'III', 'IV'], size=n_patients),
'treatment': np.random.choice(['A', 'B'], size=n_patients)
})
# Round time to integers (months)
clinical_data['time'] = clinical_data['time'].round().astype(int)
clinical_data['age'] = clinical_data['age'].round().astype(int)
print(clinical_data.head())
print(f"\nTotal patients: {len(clinical_data)}")
print(f"Events: {clinical_data['event'].sum()}")
print(f"Censored: {(1 - clinical_data['event']).sum()}")Basic Survival Analysis
Kaplan-Meier Estimator
The most common non-parametric method for estimating survival function:
from lifelines import KaplanMeierFitter
# Initialize fitter
kmf = KaplanMeierFitter()
# Fit the data
kmf.fit(
durations=clinical_data['time'],
event_observed=clinical_data['event'],
label='Overall Survival'
)
# Plot survival curve
plt.figure(figsize=(10, 6))
kmf.plot_survival_function()
plt.title('Kaplan-Meier Survival Curve')
plt.ylabel('Survival Probability')
plt.xlabel('Time (months)')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Print survival statistics
print(kmf.survival_function_.head(10))
print(f"\nMedian survival time: {kmf.median_survival_time_:.1f} months")Comparing Groups
Compare survival between different groups (e.g., treatments):
# Separate by treatment
treatment_a = clinical_data[clinical_data['treatment'] == 'A']
treatment_b = clinical_data[clinical_data['treatment'] == 'B']
# Fit KM for each group
kmf_a = KaplanMeierFitter()
kmf_a.fit(treatment_a['time'], treatment_a['event'], label='Treatment A')
kmf_b = KaplanMeierFitter()
kmf_b.fit(treatment_b['time'], treatment_b['event'], label='Treatment B')
# Plot both
plt.figure(figsize=(10, 6))
kmf_a.plot_survival_function()
kmf_b.plot_survival_function()
plt.title('Survival by Treatment Group')
plt.ylabel('Survival Probability')
plt.xlabel('Time (months)')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Statistical test (Log-rank test)
from lifelines.statistics import logrank_test
results = logrank_test(
treatment_a['time'], treatment_b['time'],
treatment_a['event'], treatment_b['event']
)
print(f"\nLog-rank test p-value: {results.p_value:.4f}")
if results.p_value < 0.05:
print("Significant difference between groups")
else:
print("No significant difference between groups")Gene Expression and Survival
A common bioinformatics application is linking gene expression to survival:
# Simulate gene expression data
clinical_data['gene_expr'] = np.random.normal(5, 2, size=n_patients)
# Stratify by gene expression (high vs low)
median_expr = clinical_data['gene_expr'].median()
clinical_data['gene_group'] = clinical_data['gene_expr'].apply(
lambda x: 'High' if x > median_expr else 'Low'
)
# Compare survival between high and low expression
high_expr = clinical_data[clinical_data['gene_group'] == 'High']
low_expr = clinical_data[clinical_data['gene_group'] == 'Low']
kmf_high = KaplanMeierFitter()
kmf_high.fit(high_expr['time'], high_expr['event'], label='High Expression')
kmf_low = KaplanMeierFitter()
kmf_low.fit(low_expr['time'], low_expr['event'], label='Low Expression')
# Plot
plt.figure(figsize=(10, 6))
kmf_high.plot_survival_function()
kmf_low.plot_survival_function()
plt.title('Survival by Gene Expression Level')
plt.ylabel('Survival Probability')
plt.xlabel('Time (months)')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Test difference
results = logrank_test(
high_expr['time'], low_expr['time'],
high_expr['event'], low_expr['event']
)
print(f"Gene expression p-value: {results.p_value:.4f}")Understanding Results
Survival Probability
At any time point, you can extract the survival probability:
# Survival at specific time points
time_points = [12, 24, 36, 48, 60] # months
for t in time_points:
survival_prob = kmf.predict(t)
print(f"Survival probability at {t} months: {survival_prob:.2%}")Median Survival Time
The time at which 50% of subjects have experienced the event:
print(f"Median survival: {kmf.median_survival_time_:.1f} months")
# Confidence interval
confidence_interval = kmf.confidence_interval_survival_function_
print(f"95% CI: [{confidence_interval.iloc[0, 0]:.1f}, "
f"{confidence_interval.iloc[0, 1]:.1f}]")Hazard Ratios
Relative risk between groups (covered in Cox Regression):
from lifelines import CoxPHFitter
# Prepare data for Cox regression
cox_data = clinical_data[['time', 'event', 'age', 'gene_expr']].copy()
# Fit Cox model
cph = CoxPHFitter()
cph.fit(cox_data, duration_col='time', event_col='event')
# Print hazard ratios
print("\nHazard Ratios:")
print(cph.summary[['exp(coef)', 'exp(coef) lower 95%',
'exp(coef) upper 95%', 'p']])Important Considerations:
- Ensure adequate follow-up time
- Account for competing risks if applicable
- Verify proportional hazards assumption for Cox models
- Consider sample size for statistical power
Common Applications
Clinical Research
- Patient survival after cancer diagnosis
- Time to disease recurrence
- Treatment efficacy comparison
Genomics
- Gene expression signatures predicting survival
- Biomarker discovery
- Risk stratification
Drug Development
- Time to treatment failure
- Duration of response
- Adverse event timing
Next Steps
- Kaplan-Meier Curves - Detailed guide to KM estimation
- Cox Regression - Multivariate survival analysis
- Survival Visualization - Advanced plotting techniques