Skip to Content
DocumentationEssential Libraries

Essential Libraries Overview

Python’s strength in bioinformatics comes from its rich ecosystem of specialized libraries. This guide introduces the core tools you’ll use for biological data analysis.

Core Bioinformatics

Biopython

The foundational library for computational biology.

from Bio import SeqIO, Entrez, Seq, AlignIO from Bio.Blast import NCBIWWW, NCBIXML

Key Features:

  • Sequence manipulation and analysis
  • File format parsing (FASTA, GenBank, PDB)
  • BLAST searches
  • Multiple sequence alignment
  • Phylogenetic analysis
  • Accessing biological databases (NCBI, UniProt)

Example:

from Bio.Seq import Seq from Bio import SeqIO # Create and manipulate sequences dna_seq = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG") print(f"Original: {dna_seq}") print(f"Complement: {dna_seq.complement()}") print(f"Reverse complement: {dna_seq.reverse_complement()}") print(f"Translation: {dna_seq.translate()}") # Parse FASTA file for record in SeqIO.parse("sequences.fasta", "fasta"): print(f"ID: {record.id}") print(f"Sequence: {record.seq[:50]}...") print(f"Length: {len(record.seq)}")

Use Cases:

  • Reading and writing sequence files
  • Calculating GC content
  • Finding open reading frames
  • Sequence alignment
  • Phylogenetic tree construction

Data Manipulation

Pandas

Essential for handling tabular biological data.

import pandas as pd

Key Features:

  • DataFrames for gene expression matrices
  • Efficient data filtering and selection
  • Group-by operations
  • Merging and joining datasets
  • Statistical operations
  • Time series analysis

Example:

import pandas as pd import numpy as np # Load gene expression data expr_df = pd.read_csv('gene_expression.csv', index_col=0) # Basic operations print(f"Shape: {expr_df.shape}") print(f"Genes: {expr_df.shape[0]}, Samples: {expr_df.shape[1]}") # Filter highly expressed genes mean_expr = expr_df.mean(axis=1) high_expr_genes = expr_df[mean_expr > mean_expr.quantile(0.75)] # Calculate correlation between samples sample_corr = expr_df.corr() # Group by metadata metadata = pd.read_csv('metadata.csv') expr_with_meta = expr_df.T.merge(metadata, left_index=True, right_on='sample') grouped = expr_with_meta.groupby('condition').mean()

Use Cases:

  • Gene expression matrices
  • Clinical data management
  • Differential expression results
  • Pathway enrichment tables
  • Metadata organization

NumPy

Fundamental numerical computing library.

import numpy as np

Key Features:

  • Fast array operations
  • Mathematical functions
  • Linear algebra
  • Random number generation
  • Broadcasting for efficient computation

Example:

import numpy as np # Create expression matrix genes = 1000 samples = 50 expr_matrix = np.random.lognormal(mean=5, sigma=2, size=(genes, samples)) # Log transform log_expr = np.log2(expr_matrix + 1) # Calculate z-scores z_scores = (expr_matrix - expr_matrix.mean(axis=1, keepdims=True)) / \ expr_matrix.std(axis=1, keepdims=True) # Find outliers outliers = np.abs(z_scores) > 3 print(f"Outlier values: {outliers.sum()}") # Correlation matrix corr_matrix = np.corrcoef(expr_matrix.T)

Statistical Analysis

SciPy

Scientific computing and statistical tests.

from scipy import stats from scipy.cluster.hierarchy import linkage, dendrogram

Key Features:

  • Statistical distributions
  • Hypothesis testing
  • Clustering algorithms
  • Signal processing
  • Optimization
  • Interpolation

Example:

from scipy import stats import numpy as np # Generate sample data control = np.random.normal(100, 15, 30) treatment = np.random.normal(110, 15, 30) # T-test t_stat, p_value = stats.ttest_ind(treatment, control) print(f"T-statistic: {t_stat:.4f}") print(f"P-value: {p_value:.4f}") # Mann-Whitney U test (non-parametric) u_stat, p_value_mw = stats.mannwhitneyu(treatment, control) print(f"Mann-Whitney U: {u_stat:.4f}") print(f"P-value: {p_value_mw:.4f}") # Correlation gene_a = np.random.randn(100) gene_b = gene_a * 0.8 + np.random.randn(100) * 0.2 corr, p_val = stats.pearsonr(gene_a, gene_b) print(f"Correlation: {corr:.4f} (p={p_val:.4f})") # Multiple testing correction p_values = np.random.uniform(0, 0.1, 1000) from scipy.stats import false_discovery_control rejected, adjusted = false_discovery_control(p_values)

Use Cases:

  • Differential expression testing
  • Correlation analysis
  • Distribution fitting
  • Hierarchical clustering
  • Multiple testing correction

Statsmodels

Advanced statistical modeling.

import statsmodels.api as sm from statsmodels.stats.multitest import multipletests

Key Features:

  • Linear regression models
  • Generalized linear models
  • Time series analysis
  • Multiple testing correction
  • Statistical tests

Example:

import statsmodels.api as sm import numpy as np import pandas as pd # Linear regression for gene expression vs. phenotype n = 100 age = np.random.normal(50, 15, n) gene_expr = 3 + 0.5 * age + np.random.normal(0, 5, n) # Add constant for intercept X = sm.add_constant(age) model = sm.OLS(gene_expr, X).fit() print(model.summary()) print(f"R-squared: {model.rsquared:.4f}") print(f"Age coefficient: {model.params[1]:.4f}") print(f"P-value: {model.pvalues[1]:.4f}") # Multiple testing correction from statsmodels.stats.multitest import multipletests p_values = np.random.uniform(0, 0.1, 1000) reject, pvals_corrected, _, _ = multipletests(p_values, method='fdr_bh') print(f"Significant after correction: {reject.sum()}")

Visualization

Matplotlib

Foundation for plotting in Python.

import matplotlib.pyplot as plt

Key Features:

  • Publication-quality figures
  • Multiple plot types
  • Customizable styling
  • Subplots and layouts
  • Export to various formats

Example:

import matplotlib.pyplot as plt import numpy as np # Create sample data x = np.linspace(0, 10, 100) y1 = np.sin(x) y2 = np.cos(x) # Create plot fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) # Line plot ax1.plot(x, y1, label='sin(x)', linewidth=2) ax1.plot(x, y2, label='cos(x)', linewidth=2) ax1.set_xlabel('X') ax1.set_ylabel('Y') ax1.set_title('Trigonometric Functions') ax1.legend() ax1.grid(alpha=0.3) # Scatter plot gene_expr = np.random.lognormal(5, 2, 1000) protein_levels = gene_expr * 0.8 + np.random.normal(0, 50, 1000) ax2.scatter(gene_expr, protein_levels, alpha=0.5) ax2.set_xlabel('Gene Expression') ax2.set_ylabel('Protein Level') ax2.set_title('Gene-Protein Correlation') plt.tight_layout() plt.savefig('plots.png', dpi=300, bbox_inches='tight') plt.show()

Seaborn

Statistical data visualization built on matplotlib.

import seaborn as sns

Key Features:

  • Beautiful default styles
  • Statistical plots
  • Heatmaps and clustermaps
  • Distribution plots
  • Categorical plots

Example:

import seaborn as sns import pandas as pd import numpy as np import matplotlib.pyplot as plt # Set style sns.set_style('whitegrid') sns.set_palette('husl') # Create sample gene expression data np.random.seed(42) data = pd.DataFrame({ 'Gene_A': np.random.lognormal(5, 1, 100), 'Gene_B': np.random.lognormal(4.5, 1.2, 100), 'Gene_C': np.random.lognormal(6, 0.8, 100), 'Condition': np.random.choice(['Control', 'Treatment'], 100) }) # Violin plot plt.figure(figsize=(10, 6)) sns.violinplot(data=data.melt(id_vars='Condition', var_name='Gene', value_name='Expression'), x='Gene', y='Expression', hue='Condition', split=True) plt.title('Gene Expression by Condition') plt.yscale('log') plt.tight_layout() plt.show() # Heatmap corr_matrix = data[['Gene_A', 'Gene_B', 'Gene_C']].corr() plt.figure(figsize=(8, 6)) sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0) plt.title('Gene Correlation Matrix') plt.tight_layout() plt.show()

Machine Learning

Scikit-learn

Comprehensive machine learning library.

from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.cluster import KMeans from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score

Key Features:

  • Preprocessing and feature scaling
  • Dimensionality reduction (PCA, t-SNE)
  • Clustering algorithms
  • Classification and regression
  • Model evaluation
  • Pipeline construction

Example:

from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt # Generate gene expression data np.random.seed(42) n_genes = 1000 n_samples = 50 expr_data = np.random.lognormal(5, 2, (n_genes, n_samples)) # Standardize scaler = StandardScaler() expr_scaled = scaler.fit_transform(expr_data.T) # PCA pca = PCA(n_components=2) pca_coords = pca.fit_transform(expr_scaled) # K-means clustering kmeans = KMeans(n_clusters=3, random_state=42) clusters = kmeans.fit_predict(pca_coords) # Plot plt.figure(figsize=(10, 6)) scatter = plt.scatter(pca_coords[:, 0], pca_coords[:, 1], c=clusters, cmap='viridis', s=100, alpha=0.6) plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})') plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})') plt.title('PCA with K-means Clustering') plt.colorbar(scatter, label='Cluster') plt.tight_layout() plt.show()

Specialized Bioinformatics

GSEApy

Gene set enrichment analysis.

import gseapy as gp from gseapy import barplot, dotplot

Key Features:

  • GSEA preranked analysis
  • Over-representation analysis (Enrichr)
  • Multiple gene set databases
  • Visualization tools

Example:

import gseapy as gp import pandas as pd # Create ranked gene list genes = [f'GENE{i}' for i in range(1000)] ranks = np.random.randn(1000) gene_rank = dict(zip(genes, ranks)) # Run GSEA pre_res = gp.prerank( rnk=gene_rank, gene_sets='KEGG_2021_Human', outdir='gsea_output', permutation_num=100, seed=42 ) # View results print(pre_res.res2d.head())

Lifelines

Survival analysis.

from lifelines import KaplanMeierFitter, CoxPHFitter from lifelines.statistics import logrank_test

Key Features:

  • Kaplan-Meier estimation
  • Cox proportional hazards
  • Survival curves
  • Statistical tests

Example:

from lifelines import KaplanMeierFitter import pandas as pd import numpy as np # Generate survival data n = 100 data = pd.DataFrame({ 'time': np.random.exponential(10, n), 'event': np.random.binomial(1, 0.7, n) }) # Fit Kaplan-Meier kmf = KaplanMeierFitter() kmf.fit(data['time'], data['event']) # Plot kmf.plot_survival_function() plt.title('Kaplan-Meier Survival Curve') plt.xlabel('Time') plt.ylabel('Survival Probability') plt.show() print(f"Median survival: {kmf.median_survival_time_:.2f}")

Quick Reference

Installation Commands

# Core bioinformatics pip install biopython # Data manipulation pip install pandas numpy # Statistics pip install scipy statsmodels # Visualization pip install matplotlib seaborn plotly # Machine learning pip install scikit-learn # Specialized pip install gseapy lifelines # All at once pip install biopython pandas numpy scipy statsmodels \ matplotlib seaborn scikit-learn gseapy lifelines

Common Import Pattern

# Standard imports for bioinformatics import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from scipy import stats from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from Bio import SeqIO import gseapy as gp from lifelines import KaplanMeierFitter # Set plot style sns.set_style('whitegrid') plt.rcParams['figure.dpi'] = 100

Library Comparison

LibraryPurposeBest ForLearning Curve
BiopythonSequencesFile parsing, alignmentsMedium
PandasData tablesGene matrices, metadataLow
NumPyNumericalMath operationsLow
SciPyStatisticsHypothesis testingMedium
MatplotlibPlottingCustom plotsMedium
SeabornVizQuick statistical plotsLow
Scikit-learnMLClassification, clusteringMedium
GSEApyEnrichmentPathway analysisLow
LifelinesSurvivalTime-to-eventMedium

Pro Tip: Start with pandas, matplotlib, and scipy for basic analysis. Add specialized libraries as your projects require them.

Next Steps

Last updated on