Skip to Content
DocumentationIntroduction to Survival Analysis

Introduction to Survival Analysis

Survival analysis examines the time until an event of interest occurs. In bioinformatics and medical research, this typically involves analyzing patient survival, disease progression, or treatment response over time.

What is Survival Analysis?

Survival analysis addresses questions like:

  • How long do patients survive after diagnosis?
  • Does a specific gene expression level predict survival?
  • Which treatment leads to better outcomes?
  • What factors influence disease-free survival?

The term “survival analysis” comes from medical research, but the methods apply to any time-to-event data, such as time to tumor recurrence, time to achieve remission, or time to disease progression.

Key Concepts

Survival Time

The duration from a defined starting point to an event:

  • Start: Diagnosis, treatment initiation, birth
  • Event: Death, relapse, progression, cure
  • Time: Days, months, years

Censoring

A critical concept when the event hasn’t occurred for all subjects:

import pandas as pd import numpy as np # Example survival data data = { 'patient_id': ['P1', 'P2', 'P3', 'P4', 'P5'], 'time': [24, 18, 36, 12, 30], 'event': [1, 1, 0, 1, 0] # 1 = event occurred, 0 = censored } survival_df = pd.DataFrame(data) print(survival_df)

Types of Censoring:

  1. Right Censoring (most common)

    • Event hasn’t occurred by end of study
    • Patient lost to follow-up
    • Patient withdrew from study
  2. Left Censoring (rare)

    • Event occurred before study began
  3. Interval Censoring

    • Event occurred between two observation times

Survival Function S(t)

Probability of surviving beyond time t:

$$S(t) = P(T > t)$$

Where:

  • S(0) = 1 (everyone alive at start)
  • S(∞) = 0 (eventually everyone experiences event)
  • S(t) is non-increasing

Hazard Function h(t)

Instantaneous rate of event occurrence at time t:

The hazard represents the risk of event at each time point.

Setting Up Survival Analysis

Install Required Libraries

pip install lifelines pandas numpy matplotlib seaborn

Import Libraries

from lifelines import KaplanMeierFitter, CoxPHFitter from lifelines.statistics import logrank_test, multivariate_logrank_test import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Set style sns.set_style('whitegrid') plt.rcParams['figure.figsize'] = (10, 6)

Prepare Data

Survival data needs at minimum:

  • Duration: Time from start to event or censoring
  • Event: Binary indicator (1 = event, 0 = censored)
# Create sample clinical data np.random.seed(42) n_patients = 100 clinical_data = pd.DataFrame({ 'patient_id': [f'P{i:03d}' for i in range(n_patients)], 'time': np.random.exponential(scale=20, size=n_patients), 'event': np.random.binomial(1, 0.7, size=n_patients), 'age': np.random.normal(60, 15, size=n_patients), 'stage': np.random.choice(['I', 'II', 'III', 'IV'], size=n_patients), 'treatment': np.random.choice(['A', 'B'], size=n_patients) }) # Round time to integers (months) clinical_data['time'] = clinical_data['time'].round().astype(int) clinical_data['age'] = clinical_data['age'].round().astype(int) print(clinical_data.head()) print(f"\nTotal patients: {len(clinical_data)}") print(f"Events: {clinical_data['event'].sum()}") print(f"Censored: {(1 - clinical_data['event']).sum()}")

Basic Survival Analysis

Kaplan-Meier Estimator

The most common non-parametric method for estimating survival function:

from lifelines import KaplanMeierFitter # Initialize fitter kmf = KaplanMeierFitter() # Fit the data kmf.fit( durations=clinical_data['time'], event_observed=clinical_data['event'], label='Overall Survival' ) # Plot survival curve plt.figure(figsize=(10, 6)) kmf.plot_survival_function() plt.title('Kaplan-Meier Survival Curve') plt.ylabel('Survival Probability') plt.xlabel('Time (months)') plt.grid(alpha=0.3) plt.tight_layout() plt.show() # Print survival statistics print(kmf.survival_function_.head(10)) print(f"\nMedian survival time: {kmf.median_survival_time_:.1f} months")

Comparing Groups

Compare survival between different groups (e.g., treatments):

# Separate by treatment treatment_a = clinical_data[clinical_data['treatment'] == 'A'] treatment_b = clinical_data[clinical_data['treatment'] == 'B'] # Fit KM for each group kmf_a = KaplanMeierFitter() kmf_a.fit(treatment_a['time'], treatment_a['event'], label='Treatment A') kmf_b = KaplanMeierFitter() kmf_b.fit(treatment_b['time'], treatment_b['event'], label='Treatment B') # Plot both plt.figure(figsize=(10, 6)) kmf_a.plot_survival_function() kmf_b.plot_survival_function() plt.title('Survival by Treatment Group') plt.ylabel('Survival Probability') plt.xlabel('Time (months)') plt.grid(alpha=0.3) plt.tight_layout() plt.show() # Statistical test (Log-rank test) from lifelines.statistics import logrank_test results = logrank_test( treatment_a['time'], treatment_b['time'], treatment_a['event'], treatment_b['event'] ) print(f"\nLog-rank test p-value: {results.p_value:.4f}") if results.p_value < 0.05: print("Significant difference between groups") else: print("No significant difference between groups")

Gene Expression and Survival

A common bioinformatics application is linking gene expression to survival:

# Simulate gene expression data clinical_data['gene_expr'] = np.random.normal(5, 2, size=n_patients) # Stratify by gene expression (high vs low) median_expr = clinical_data['gene_expr'].median() clinical_data['gene_group'] = clinical_data['gene_expr'].apply( lambda x: 'High' if x > median_expr else 'Low' ) # Compare survival between high and low expression high_expr = clinical_data[clinical_data['gene_group'] == 'High'] low_expr = clinical_data[clinical_data['gene_group'] == 'Low'] kmf_high = KaplanMeierFitter() kmf_high.fit(high_expr['time'], high_expr['event'], label='High Expression') kmf_low = KaplanMeierFitter() kmf_low.fit(low_expr['time'], low_expr['event'], label='Low Expression') # Plot plt.figure(figsize=(10, 6)) kmf_high.plot_survival_function() kmf_low.plot_survival_function() plt.title('Survival by Gene Expression Level') plt.ylabel('Survival Probability') plt.xlabel('Time (months)') plt.grid(alpha=0.3) plt.tight_layout() plt.show() # Test difference results = logrank_test( high_expr['time'], low_expr['time'], high_expr['event'], low_expr['event'] ) print(f"Gene expression p-value: {results.p_value:.4f}")

Understanding Results

Survival Probability

At any time point, you can extract the survival probability:

# Survival at specific time points time_points = [12, 24, 36, 48, 60] # months for t in time_points: survival_prob = kmf.predict(t) print(f"Survival probability at {t} months: {survival_prob:.2%}")

Median Survival Time

The time at which 50% of subjects have experienced the event:

print(f"Median survival: {kmf.median_survival_time_:.1f} months") # Confidence interval confidence_interval = kmf.confidence_interval_survival_function_ print(f"95% CI: [{confidence_interval.iloc[0, 0]:.1f}, " f"{confidence_interval.iloc[0, 1]:.1f}]")

Hazard Ratios

Relative risk between groups (covered in Cox Regression):

from lifelines import CoxPHFitter # Prepare data for Cox regression cox_data = clinical_data[['time', 'event', 'age', 'gene_expr']].copy() # Fit Cox model cph = CoxPHFitter() cph.fit(cox_data, duration_col='time', event_col='event') # Print hazard ratios print("\nHazard Ratios:") print(cph.summary[['exp(coef)', 'exp(coef) lower 95%', 'exp(coef) upper 95%', 'p']])

Important Considerations:

  • Ensure adequate follow-up time
  • Account for competing risks if applicable
  • Verify proportional hazards assumption for Cox models
  • Consider sample size for statistical power

Common Applications

Clinical Research

  • Patient survival after cancer diagnosis
  • Time to disease recurrence
  • Treatment efficacy comparison

Genomics

  • Gene expression signatures predicting survival
  • Biomarker discovery
  • Risk stratification

Drug Development

  • Time to treatment failure
  • Duration of response
  • Adverse event timing

Next Steps

Last updated on