Essential Libraries Overview
Python’s strength in bioinformatics comes from its rich ecosystem of specialized libraries. This guide introduces the core tools you’ll use for biological data analysis.
Core Bioinformatics
Biopython
The foundational library for computational biology.
from Bio import SeqIO, Entrez, Seq, AlignIO
from Bio.Blast import NCBIWWW, NCBIXMLKey Features:
- Sequence manipulation and analysis
- File format parsing (FASTA, GenBank, PDB)
- BLAST searches
- Multiple sequence alignment
- Phylogenetic analysis
- Accessing biological databases (NCBI, UniProt)
Example:
from Bio.Seq import Seq
from Bio import SeqIO
# Create and manipulate sequences
dna_seq = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(f"Original: {dna_seq}")
print(f"Complement: {dna_seq.complement()}")
print(f"Reverse complement: {dna_seq.reverse_complement()}")
print(f"Translation: {dna_seq.translate()}")
# Parse FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"Sequence: {record.seq[:50]}...")
print(f"Length: {len(record.seq)}")Use Cases:
- Reading and writing sequence files
- Calculating GC content
- Finding open reading frames
- Sequence alignment
- Phylogenetic tree construction
Data Manipulation
Pandas
Essential for handling tabular biological data.
import pandas as pdKey Features:
- DataFrames for gene expression matrices
- Efficient data filtering and selection
- Group-by operations
- Merging and joining datasets
- Statistical operations
- Time series analysis
Example:
import pandas as pd
import numpy as np
# Load gene expression data
expr_df = pd.read_csv('gene_expression.csv', index_col=0)
# Basic operations
print(f"Shape: {expr_df.shape}")
print(f"Genes: {expr_df.shape[0]}, Samples: {expr_df.shape[1]}")
# Filter highly expressed genes
mean_expr = expr_df.mean(axis=1)
high_expr_genes = expr_df[mean_expr > mean_expr.quantile(0.75)]
# Calculate correlation between samples
sample_corr = expr_df.corr()
# Group by metadata
metadata = pd.read_csv('metadata.csv')
expr_with_meta = expr_df.T.merge(metadata, left_index=True, right_on='sample')
grouped = expr_with_meta.groupby('condition').mean()Use Cases:
- Gene expression matrices
- Clinical data management
- Differential expression results
- Pathway enrichment tables
- Metadata organization
NumPy
Fundamental numerical computing library.
import numpy as npKey Features:
- Fast array operations
- Mathematical functions
- Linear algebra
- Random number generation
- Broadcasting for efficient computation
Example:
import numpy as np
# Create expression matrix
genes = 1000
samples = 50
expr_matrix = np.random.lognormal(mean=5, sigma=2, size=(genes, samples))
# Log transform
log_expr = np.log2(expr_matrix + 1)
# Calculate z-scores
z_scores = (expr_matrix - expr_matrix.mean(axis=1, keepdims=True)) / \
expr_matrix.std(axis=1, keepdims=True)
# Find outliers
outliers = np.abs(z_scores) > 3
print(f"Outlier values: {outliers.sum()}")
# Correlation matrix
corr_matrix = np.corrcoef(expr_matrix.T)Statistical Analysis
SciPy
Scientific computing and statistical tests.
from scipy import stats
from scipy.cluster.hierarchy import linkage, dendrogramKey Features:
- Statistical distributions
- Hypothesis testing
- Clustering algorithms
- Signal processing
- Optimization
- Interpolation
Example:
from scipy import stats
import numpy as np
# Generate sample data
control = np.random.normal(100, 15, 30)
treatment = np.random.normal(110, 15, 30)
# T-test
t_stat, p_value = stats.ttest_ind(treatment, control)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
# Mann-Whitney U test (non-parametric)
u_stat, p_value_mw = stats.mannwhitneyu(treatment, control)
print(f"Mann-Whitney U: {u_stat:.4f}")
print(f"P-value: {p_value_mw:.4f}")
# Correlation
gene_a = np.random.randn(100)
gene_b = gene_a * 0.8 + np.random.randn(100) * 0.2
corr, p_val = stats.pearsonr(gene_a, gene_b)
print(f"Correlation: {corr:.4f} (p={p_val:.4f})")
# Multiple testing correction
p_values = np.random.uniform(0, 0.1, 1000)
from scipy.stats import false_discovery_control
rejected, adjusted = false_discovery_control(p_values)Use Cases:
- Differential expression testing
- Correlation analysis
- Distribution fitting
- Hierarchical clustering
- Multiple testing correction
Statsmodels
Advanced statistical modeling.
import statsmodels.api as sm
from statsmodels.stats.multitest import multipletestsKey Features:
- Linear regression models
- Generalized linear models
- Time series analysis
- Multiple testing correction
- Statistical tests
Example:
import statsmodels.api as sm
import numpy as np
import pandas as pd
# Linear regression for gene expression vs. phenotype
n = 100
age = np.random.normal(50, 15, n)
gene_expr = 3 + 0.5 * age + np.random.normal(0, 5, n)
# Add constant for intercept
X = sm.add_constant(age)
model = sm.OLS(gene_expr, X).fit()
print(model.summary())
print(f"R-squared: {model.rsquared:.4f}")
print(f"Age coefficient: {model.params[1]:.4f}")
print(f"P-value: {model.pvalues[1]:.4f}")
# Multiple testing correction
from statsmodels.stats.multitest import multipletests
p_values = np.random.uniform(0, 0.1, 1000)
reject, pvals_corrected, _, _ = multipletests(p_values, method='fdr_bh')
print(f"Significant after correction: {reject.sum()}")Visualization
Matplotlib
Foundation for plotting in Python.
import matplotlib.pyplot as pltKey Features:
- Publication-quality figures
- Multiple plot types
- Customizable styling
- Subplots and layouts
- Export to various formats
Example:
import matplotlib.pyplot as plt
import numpy as np
# Create sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
# Create plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Line plot
ax1.plot(x, y1, label='sin(x)', linewidth=2)
ax1.plot(x, y2, label='cos(x)', linewidth=2)
ax1.set_xlabel('X')
ax1.set_ylabel('Y')
ax1.set_title('Trigonometric Functions')
ax1.legend()
ax1.grid(alpha=0.3)
# Scatter plot
gene_expr = np.random.lognormal(5, 2, 1000)
protein_levels = gene_expr * 0.8 + np.random.normal(0, 50, 1000)
ax2.scatter(gene_expr, protein_levels, alpha=0.5)
ax2.set_xlabel('Gene Expression')
ax2.set_ylabel('Protein Level')
ax2.set_title('Gene-Protein Correlation')
plt.tight_layout()
plt.savefig('plots.png', dpi=300, bbox_inches='tight')
plt.show()Seaborn
Statistical data visualization built on matplotlib.
import seaborn as snsKey Features:
- Beautiful default styles
- Statistical plots
- Heatmaps and clustermaps
- Distribution plots
- Categorical plots
Example:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Set style
sns.set_style('whitegrid')
sns.set_palette('husl')
# Create sample gene expression data
np.random.seed(42)
data = pd.DataFrame({
'Gene_A': np.random.lognormal(5, 1, 100),
'Gene_B': np.random.lognormal(4.5, 1.2, 100),
'Gene_C': np.random.lognormal(6, 0.8, 100),
'Condition': np.random.choice(['Control', 'Treatment'], 100)
})
# Violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(data=data.melt(id_vars='Condition', var_name='Gene', value_name='Expression'),
x='Gene', y='Expression', hue='Condition', split=True)
plt.title('Gene Expression by Condition')
plt.yscale('log')
plt.tight_layout()
plt.show()
# Heatmap
corr_matrix = data[['Gene_A', 'Gene_B', 'Gene_C']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Gene Correlation Matrix')
plt.tight_layout()
plt.show()Machine Learning
Scikit-learn
Comprehensive machine learning library.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_scoreKey Features:
- Preprocessing and feature scaling
- Dimensionality reduction (PCA, t-SNE)
- Clustering algorithms
- Classification and regression
- Model evaluation
- Pipeline construction
Example:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
# Generate gene expression data
np.random.seed(42)
n_genes = 1000
n_samples = 50
expr_data = np.random.lognormal(5, 2, (n_genes, n_samples))
# Standardize
scaler = StandardScaler()
expr_scaled = scaler.fit_transform(expr_data.T)
# PCA
pca = PCA(n_components=2)
pca_coords = pca.fit_transform(expr_scaled)
# K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(pca_coords)
# Plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(pca_coords[:, 0], pca_coords[:, 1],
c=clusters, cmap='viridis', s=100, alpha=0.6)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
plt.title('PCA with K-means Clustering')
plt.colorbar(scatter, label='Cluster')
plt.tight_layout()
plt.show()Specialized Bioinformatics
GSEApy
Gene set enrichment analysis.
import gseapy as gp
from gseapy import barplot, dotplotKey Features:
- GSEA preranked analysis
- Over-representation analysis (Enrichr)
- Multiple gene set databases
- Visualization tools
Example:
import gseapy as gp
import pandas as pd
# Create ranked gene list
genes = [f'GENE{i}' for i in range(1000)]
ranks = np.random.randn(1000)
gene_rank = dict(zip(genes, ranks))
# Run GSEA
pre_res = gp.prerank(
rnk=gene_rank,
gene_sets='KEGG_2021_Human',
outdir='gsea_output',
permutation_num=100,
seed=42
)
# View results
print(pre_res.res2d.head())Lifelines
Survival analysis.
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_testKey Features:
- Kaplan-Meier estimation
- Cox proportional hazards
- Survival curves
- Statistical tests
Example:
from lifelines import KaplanMeierFitter
import pandas as pd
import numpy as np
# Generate survival data
n = 100
data = pd.DataFrame({
'time': np.random.exponential(10, n),
'event': np.random.binomial(1, 0.7, n)
})
# Fit Kaplan-Meier
kmf = KaplanMeierFitter()
kmf.fit(data['time'], data['event'])
# Plot
kmf.plot_survival_function()
plt.title('Kaplan-Meier Survival Curve')
plt.xlabel('Time')
plt.ylabel('Survival Probability')
plt.show()
print(f"Median survival: {kmf.median_survival_time_:.2f}")Quick Reference
Installation Commands
# Core bioinformatics
pip install biopython
# Data manipulation
pip install pandas numpy
# Statistics
pip install scipy statsmodels
# Visualization
pip install matplotlib seaborn plotly
# Machine learning
pip install scikit-learn
# Specialized
pip install gseapy lifelines
# All at once
pip install biopython pandas numpy scipy statsmodels \
matplotlib seaborn scikit-learn gseapy lifelinesCommon Import Pattern
# Standard imports for bioinformatics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from Bio import SeqIO
import gseapy as gp
from lifelines import KaplanMeierFitter
# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100Library Comparison
| Library | Purpose | Best For | Learning Curve |
|---|---|---|---|
| Biopython | Sequences | File parsing, alignments | Medium |
| Pandas | Data tables | Gene matrices, metadata | Low |
| NumPy | Numerical | Math operations | Low |
| SciPy | Statistics | Hypothesis testing | Medium |
| Matplotlib | Plotting | Custom plots | Medium |
| Seaborn | Viz | Quick statistical plots | Low |
| Scikit-learn | ML | Classification, clustering | Medium |
| GSEApy | Enrichment | Pathway analysis | Low |
| Lifelines | Survival | Time-to-event | Medium |
Pro Tip: Start with pandas, matplotlib, and scipy for basic analysis. Add specialized libraries as your projects require them.
Next Steps
- Sequence Analysis - Start with Biopython
- Data Manipulation - Master pandas
- PCA - Learn dimensionality reduction with scikit-learn