Skip to content

Latest commit

 

History

History
682 lines (520 loc) · 20.7 KB

File metadata and controls

682 lines (520 loc) · 20.7 KB

RNA-seq Analysis Pipeline Script Documentation

Table of Contents

  1. 01_run_fastqc.sh
  2. 02_build_index_minimal_memory.sh
  3. 03_check_read_length.sh
  4. 04_trim_reads_linux_fs.sh
  5. 05_run_fastqc_trimmed.sh
  6. 06_align_reads_hisat2_parallel_verbose.sh
  7. 07_count_features.sh
  8. 08_run_deseq2.sh / 08_deseq2_analysis.R
  9. 09_run_gene_categories.sh / 09_extract_gene_categories.R
  10. 10_run_separate_plots.sh / 10_plot_gene_categories_separately.R
  11. 11_run_gene_list_prep.sh / 11_prepare_gene_lists_for_enrichment.R
  12. 12_run_enrichment_analysis.sh / 12_functional_enrichment_analysis.R

01_run_fastqc.sh

Description

Performs quality control analysis on raw sequencing reads using FastQC and generates an aggregate report with MultiQC.

Usage

bash scripts/01_run_fastqc.sh

Arguments

None

Input Files

  • raw_data/*.fastq.gz - Raw paired-end FASTQ files (symlinks to original data)

Output Files

  • results/qc/fastqc_raw/*_fastqc.html - Individual FastQC HTML reports
  • results/qc/fastqc_raw/*_fastqc.zip - FastQC data files
  • results/qc/multiqc_raw/multiqc_report.html - Aggregated MultiQC report

Dependencies

  • FastQC v0.12.1
  • MultiQC v1.14
  • conda environment: rnaseq-zf

Details

The script processes all FASTQ files in the raw_data directory using 8 threads per file. FastQC analyzes sequence quality, GC content, adapter contamination, overrepresented sequences, and other quality metrics. MultiQC aggregates all FastQC reports into a single interactive HTML report.

Examples

# Run quality control on raw reads
bash scripts/01_run_fastqc.sh
# View results in web browser
firefox results/qc/multiqc_raw/multiqc_report.html

Notes

  • Processing time: ~5-10 minutes for 6 files
  • Memory usage: ~2GB per file
  • Output files total ~100MB

02_build_index_minimal_memory.sh

Description

Builds HISAT2 genome index with minimal memory usage, optimized for systems with limited RAM.

Usage

bash scripts/02_build_index_minimal_memory.sh

Arguments

None

Input Files

  • ~/rnaseq_workspace/reference/bTaeGut7.mat.fa - Reference genome FASTA
  • ~/rnaseq_workspace/reference/bTaeGut7.mat.gtf - Gene annotations GTF

Output Files

  • ~/rnaseq_workspace/reference/hisat2_index/*.ht2 - 8 HISAT2 index files
  • ~/rnaseq_workspace/reference/hisat2_index/splice_sites.txt - Splice site coordinates
  • ~/rnaseq_workspace/reference/hisat2_index/exons.txt - Exon coordinates
  • reference/hisat2_index/ - Copy in project directory

Dependencies

  • HISAT2 v2.2.1
  • Python (for GTF parsing scripts)
  • ~8GB RAM minimum

Details

Uses 2 threads to minimize memory footprint while building a splice-aware index. Extracts splice sites and exons from GTF file for accurate RNA-seq alignment. Works entirely on Linux filesystem to avoid I/O bottlenecks.

Environment Variables

  • THREADS=2 - Number of threads (hardcoded for memory efficiency)

Examples

# Build index with minimal memory
bash scripts/02_build_index_minimal_memory.sh
# Check index files
ls -lh ~/rnaseq_workspace/reference/hisat2_index/

Notes

  • Processing time: 30-60 minutes
  • Memory usage: ~4-6GB peak
  • Index size: ~2.2GB total

03_check_read_length.sh

Description

Verifies the actual read length of FASTQ files by sampling sequences.

Usage

bash scripts/03_check_read_length.sh

Arguments

None

Input Files

  • raw_data/*.fastq.gz - Raw FASTQ files to check

Output

Console output showing read length distribution for each file

Dependencies

  • Standard Unix tools (zcat, awk, sort, uniq)

Details

Samples first 1000 reads from each FASTQ file and reports the length distribution. Essential for verifying expected read length before alignment parameter optimization.

Examples

bash scripts/03_check_read_length.sh
# Expected output:
# Checking read length for 12590-KS-1_S1_L005_R1_001.fastq.gz:
#    1000 151

Notes

  • Quick check: <1 minute total
  • Confirms 151bp read length for parameter settings

04_trim_reads_linux_fs.sh

Description

Performs adapter trimming and quality filtering using fastp with all I/O on Linux filesystem for maximum performance.

Usage

bash scripts/04_trim_reads_linux_fs.sh

Arguments

None

Input Files

  • ~/rnaseq_workspace/raw_data/*_R1_001.fastq.gz - Forward reads
  • ~/rnaseq_workspace/raw_data/*_R2_001.fastq.gz - Reverse reads

Output Files

  • ~/rnaseq_workspace/trimmed_data/*_R[12]_trimmed.fastq.gz - Trimmed reads
  • ~/rnaseq_workspace/qc/fastp_reports/*.html - Fastp QC reports
  • ~/rnaseq_workspace/qc/fastp_reports/*.json - Fastp statistics
  • trimmed_data/ - Copy in project directory

Dependencies

  • fastp v0.23.4
  • ~10GB free disk space

Parameters

  • --qualified_quality_phred 20 - Minimum base quality
  • --length_required 36 - Minimum read length
  • --detect_adapter_for_pe - Auto-detect adapters
  • --thread 8 - Processing threads

Details

Performs quality-based trimming, adapter removal, and filtering. Automatically detects and removes Illumina adapters. Generates detailed QC reports including insert size distribution.

Examples

# Run trimming on Linux filesystem
bash scripts/04_trim_reads_linux_fs.sh
# Check trimming statistics
grep "reads passed filter" ~/rnaseq_workspace/logs/*_fastp.log

Notes

  • Processing time: ~6.5 minutes total
  • ~98% read retention rate
  • Output size: ~34GB

05_run_fastqc_trimmed.sh

Description

Performs quality control analysis on trimmed reads to assess improvement after processing.

Usage

bash scripts/05_run_fastqc_trimmed.sh

Arguments

None

Input Files

  • trimmed_data/*_trimmed.fastq.gz - Trimmed FASTQ files

Output Files

  • results/qc/fastqc_trimmed/*_fastqc.html - FastQC reports
  • results/qc/fastqc_trimmed/*_fastqc.zip - FastQC data
  • results/qc/multiqc_trimmed/*_multiqc_report.html - MultiQC summary

Dependencies

  • FastQC v0.12.1
  • MultiQC v1.14

Details

Identical analysis to 01_run_fastqc.sh but on trimmed reads. Enables direct comparison of quality metrics before and after trimming.

Examples

bash scripts/05_run_fastqc_trimmed.sh
# Compare with raw QC
diff results/qc/multiqc_raw/multiqc_data/multiqc_general_stats.txt \
     results/qc/multiqc_trimmed/*/multiqc_general_stats.txt

Notes

  • Verifies adapter removal success
  • Confirms quality score improvements

06_align_reads_hisat2_parallel_verbose.sh

Description

Performs splice-aware alignment of RNA-seq reads using HISAT2 with parallel processing and progress monitoring.

Usage

bash scripts/06_align_reads_hisat2_parallel_verbose.sh

Arguments

None

Input Files

  • ~/rnaseq_workspace/trimmed_data/*_R[12]_trimmed.fastq.gz - Trimmed reads
  • ~/rnaseq_workspace/reference/hisat2_index/bTaeGut7.mat - HISAT2 index

Output Files

  • ~/rnaseq_workspace/alignments/*_aligned_sorted.bam - Sorted alignments
  • ~/rnaseq_workspace/alignments/*_aligned_sorted.bam.bai - BAM indices
  • ~/rnaseq_workspace/logs/alignment/*_alignment_summary.txt - HISAT2 statistics
  • ~/rnaseq_workspace/logs/alignment/*_flagstat.txt - Samtools statistics
  • alignments/ and logs/alignment/ - Copies in project directory

Dependencies

  • HISAT2 v2.2.1
  • Samtools v1.20
  • ~40GB free disk space

Parameters

  • -p 8 - Threads per sample
  • --dta - Downstream transcript assembly mode
  • --rna-strandness RF - Reverse stranded library
  • --maxins 800 - Maximum insert size
  • --no-softclip - Disable soft clipping

Details

Processes all three samples in parallel with real-time progress monitoring. Pipes HISAT2 output directly to samtools for sorting, avoiding large intermediate SAM files. Includes automatic indexing and comprehensive statistics generation.

Environment Variables

  • INDEX_PATH - Location of HISAT2 index (auto-detected)

Examples

# Run parallel alignment with progress monitoring
bash scripts/06_align_reads_hisat2_parallel_verbose.sh
# Check alignment rates
grep "overall alignment rate" logs/alignment/*_alignment_summary.txt

Notes

  • Processing time: ~42 minutes (parallel)
  • Memory usage: ~27GB total for 3 samples
  • Alignment rates: 79-81%

07_count_features.sh

Description

Quantifies gene expression by counting reads overlapping genomic features using featureCounts.

Usage

bash scripts/07_count_features.sh

Arguments

None

Input Files

  • alignments/*_aligned_sorted.bam - Sorted BAM files
  • reference/bTaeGut7.mat.gtf - Gene annotations

Output Files

  • counts/gene_counts.txt - Full count matrix with annotations
  • counts/gene_counts_matrix_named.txt - Simplified count matrix
  • counts/gene_counts.txt.summary - Assignment statistics
  • counts/library_statistics.txt - Sample-level statistics
  • counts/top_expressed_genes.txt - Highest expression genes
  • counts/count_summary_report.txt - Complete summary

Dependencies

  • featureCounts v2.0.6 (Subread)
  • R with base packages

Parameters

  • -t exon - Count exon features
  • -g gene_id - Group by gene
  • -p - Count fragments (paired-end)
  • -B - Both ends mapped
  • -C - Exclude chimeric reads
  • -s 2 - Reverse stranded
  • -T 8 - Number of threads

Details

Performs strand-specific quantification at gene level. Includes automatic calculation of library sizes, gene detection rates, and identification of highly expressed genes. Generates both full annotation matrix and simplified count matrix for downstream analysis.

Examples

bash scripts/07_count_features.sh
# Check assignment rate
grep "Assigned" counts/gene_counts.txt.summary
# View top genes
head counts/top_expressed_genes.txt

Notes

  • Processing time: 5-10 minutes
  • Assignment rate: 60-65%
  • Output matrix: 20,040 genes × 3 samples

08_run_deseq2.sh / 08_deseq2_analysis.R

Description

Performs differential expression analysis between brain regions using DESeq2, adapted for single-replicate design.

Usage

bash scripts/08_run_deseq2.sh

Arguments

None (R script arguments handled by wrapper)

Input Files

  • counts/gene_counts_matrix_named.txt - Gene count matrix
  • Gene annotations (internal to R)

Output Files

  • results/differential_expression/normalized_counts.txt - Size-factor normalized counts
  • results/differential_expression/DE_results_*.txt - Full comparison results
  • results/differential_expression/DE_2fold_*.txt - Genes with >2-fold change
  • results/differential_expression/analysis_summary.txt - Summary statistics
  • results/plots/PCA_brain_regions.pdf - PCA plot
  • results/plots/volcano_*.pdf - Volcano plots
  • results/plots/sample_distance_heatmap.pdf - Sample clustering
  • results/plots/top50_variable_genes_heatmap.pdf - Variable gene heatmap

Dependencies

  • R v4.3.3
  • DESeq2 v1.42.0
  • ggplot2, pheatmap, tidyverse, EnhancedVolcano

Parameters

  • Filter: ≥10 reads in ≥2 samples
  • Fold change threshold: |log2FC| > 1
  • No p-values (single replicates)

Details

Modified DESeq2 workflow for unreplicated data. Uses size factor normalization and variance stabilizing transformation for visualization. Performs all pairwise comparisons and identifies differentially expressed genes based on fold change criteria.

R Functions

  • perform_comparison() - Pairwise fold change analysis
  • create_single_scale_plot() - Visualization functions

Examples

# Run differential expression analysis
bash scripts/08_run_deseq2.sh
# View summary
cat results/differential_expression/analysis_summary.txt
# Check specific comparison
head results/differential_expression/DE_2fold_Area_X_vs_RA.txt

Notes

  • No statistical significance testing
  • Fold change threshold: 2-fold
  • ~3,500 genes differ between regions

09_run_gene_categories.sh / 09_extract_gene_categories.R

Description

Extracts and analyzes expression patterns of categorized genes (housekeeping and song system genes).

Usage

bash scripts/09_run_gene_categories.sh

Arguments

None

Input Files

  • results/differential_expression/gene_categories_for_plotting.csv - Gene categories
  • results/differential_expression/normalized_counts.txt - Expression data

Output Files

  • results/differential_expression/categorized_gene_expression.csv - Combined data
  • results/plots/housekeeping_gene_stability.pdf - Stability analysis
  • results/plots/gene_categories_heatmap.pdf - Expression heatmap
  • results/plots/category_expression_comparison.pdf - Category comparison
  • results/plots/key_genes_expression.pdf - Selected gene plots

Dependencies

  • R with tidyverse, ggplot2, pheatmap

Gene Categories

  • Housekeeping: Olias2014 (PGK1, RPS7, TFRC, YWHAZ) and ZinzowKramer2014 (HPRT1, RPL4, PPIA)
  • Song System: Transcription factors, immediate early genes, neurotrophic factors, receptors, etc.

Details

Analyzes stability of housekeeping genes across brain regions and expression patterns of song system genes. Calculates coefficient of variation for stability assessment and generates multiple visualization formats.

Examples

bash scripts/09_run_gene_categories.sh
# Most stable housekeeping gene will be shown in console output

Notes

  • Identifies region-specific marker genes
  • Validates housekeeping gene assumptions

10_run_separate_plots.sh / 10_plot_gene_categories_separately.R

Description

Creates publication-quality expression plots for gene categories with multiple visualization options.

Usage

bash scripts/10_run_separate_plots.sh

Arguments

None

Input Files

  • results/differential_expression/gene_categories_for_plotting.csv - Gene list
  • results/differential_expression/normalized_counts.txt - Expression values

Output Files

Multi-panel plots (alphabetical):

  • results/plots/song_system_genes_expression_separated.pdf
  • results/plots/housekeeping_genes_expression.pdf

Single-scale plots (by expression):

  • results/plots/song_system_genes_single_scale.pdf
  • results/plots/housekeeping_genes_single_scale.pdf
  • results/plots/song_system_genes_log_scale.pdf
  • results/plots/housekeeping_genes_log_scale.pdf

Subcategory plots:

  • results/plots/song_system_*_genes.pdf

Dependencies

  • R with ggplot2, tidyverse

Plot Types

  1. Multi-panel: Each gene in separate panel, alphabetical order
  2. Single-scale: All genes on one plot, ordered by expression
  3. Log-scale: Same as single-scale with log10 y-axis
  4. Subcategory: Grouped by functional annotation

Color Scheme

  • Area X: #E69F00 (orange)
  • RA: #56B4E9 (sky blue)
  • HVC: #009E73 (teal)

Details

Generates multiple visualization formats optimized for different purposes. Multi-panel plots show individual patterns, single-scale enables direct comparison, log-scale handles wide expression ranges. All plots use colorblind-friendly palettes.

Examples

bash scripts/10_run_separate_plots.sh
# View expression ordered by level
evince results/plots/song_system_genes_single_scale.pdf

Notes

  • Horizontal bar orientation for readability
  • Dynamic plot sizing based on gene count
  • Consistent color scheme across all plots


11_run_gene_list_prep.sh / 11_prepare_gene_lists_for_enrichment.R

Description

Prepares gene lists for functional enrichment analysis by identifying region-enriched genes based on fold-change criteria.

Usage

bash scripts/11_run_gene_list_prep.sh

Arguments

None

Input Files

  • results/differential_expression/DE_results_Area_X_vs_RA.txt
  • results/differential_expression/DE_results_Area_X_vs_HVC.txt
  • results/differential_expression/DE_results_RA_vs_HVC.txt

Output Files

Region-enriched gene lists:

  • results/enrichment_analysis/area_x_enriched_genes.txt - Genes higher in Area X vs both RA and HVC
  • results/enrichment_analysis/ra_enriched_genes.txt - Genes higher in RA vs both Area X and HVC
  • results/enrichment_analysis/hvc_enriched_genes.txt - Genes higher in HVC vs both Area X and RA
  • results/enrichment_analysis/background_genes.txt - All tested genes (for statistical background)

Full tables with fold changes:

  • results/enrichment_analysis/*_enriched_genes_full.txt - Includes fold change values

Pairwise comparisons:

  • results/enrichment_analysis/area_x_up_vs_ra_genes.txt
  • results/enrichment_analysis/area_x_up_vs_hvc_genes.txt
  • results/enrichment_analysis/ra_up_vs_hvc_genes.txt

Dependencies

  • R with tidyverse

Parameters

  • FC_THRESHOLD = 1 - Log2 fold change threshold (2-fold linear)

Details

Identifies genes specifically enriched in each brain region using a stringent criterion: a gene must be >2-fold higher in one region compared to BOTH other regions. Also extracts genes from pairwise comparisons for broader analysis. Outputs include both simple gene lists and detailed tables with fold change information.

Examples

bash scripts/11_run_gene_list_prep.sh
# Check number of enriched genes
wc -l results/enrichment_analysis/*_enriched_genes.txt

Notes

  • Region-enriched lists are more stringent than pairwise comparisons
  • Background gene list used for enrichment statistics
  • Summary statistics printed to console

12_run_enrichment_analysis.sh / 12_functional_enrichment_analysis.R

Description

Maps zebra finch genes to human orthologs and performs functional enrichment analysis using GO terms and KEGG pathways.

Usage

bash scripts/12_run_enrichment_analysis.sh

Arguments

None

Input Files

  • results/enrichment_analysis/area_x_enriched_genes.txt
  • results/enrichment_analysis/ra_enriched_genes.txt
  • results/enrichment_analysis/hvc_enriched_genes.txt
  • results/enrichment_analysis/background_genes.txt

Output Files

Ortholog mappings:

  • results/enrichment_analysis/*_human_orthologs.txt - Zebra finch to human gene mappings

GO enrichment results:

  • results/enrichment_analysis/GO_results/*_GO_BP.txt - Biological Process
  • results/enrichment_analysis/GO_results/*_GO_CC.txt - Cellular Component
  • results/enrichment_analysis/GO_results/*_GO_MF.txt - Molecular Function

KEGG results:

  • results/enrichment_analysis/KEGG_results/*_KEGG.txt - Pathway enrichment

Visualizations:

  • results/plots/enrichment/*_GO_BP_barplot.pdf - Top 20 GO terms
  • results/plots/enrichment/*_GO_BP_dotplot.pdf - GO enrichment dot plots
  • results/plots/enrichment/*_KEGG_dotplot.pdf - KEGG pathway plots

Summary:

  • results/enrichment_analysis/enrichment_summary.txt - Top enriched terms for all regions

Dependencies

  • R packages: biomaRt, clusterProfiler, org.Hs.eg.db, enrichplot, ggplot2, tidyverse
  • Internet connection (for BioMart queries)

Parameters

  • pvalueCutoff = 0.05 - Enrichment p-value threshold
  • qvalueCutoff = 0.2 - FDR threshold
  • Ortholog filters: one-to-one and one-to-many only

Details

Connects to Ensembl BioMart to retrieve human orthologs for zebra finch genes. Only high-confidence orthologs are retained. Performs hypergeometric testing for GO term and KEGG pathway enrichment using the clusterProfiler package. Background gene set is used for proper statistical testing.

R Functions

  • map_to_human_orthologs() - BioMart ortholog mapping
  • perform_enrichment() - GO and KEGG analysis
  • create_enrichment_plots() - Visualization generation

Examples

# Check if required packages are installed first
bash scripts/12_run_enrichment_analysis.sh
# If packages missing, install with:
Rscript scripts/install_enrichment_packages.R
# Then run analysis
bash scripts/12_run_enrichment_analysis.sh

Notes

  • BioMart queries can take several minutes
  • Not all zebra finch genes have human orthologs
  • Enrichment p-values are valid despite single replicates
  • Internet required for Ensembl database access

General Pipeline Notes

Environment Setup

All scripts require activation of the conda environment:

conda activate rnaseq-zf

File Organization

  • Scripts: scripts/
  • Raw data: raw_data/ (symlinks)
  • Linux FS workspace: ~/rnaseq_workspace/
  • Results: results/, counts/, alignments/

Performance Optimization

  • Linux filesystem used for I/O intensive operations
  • Parallel processing where applicable
  • Memory-efficient parameter choices

Error Handling

All scripts use set -euo pipefail for robust error handling and will exit on first error.

Logging

Most scripts generate detailed logs in logs/ subdirectories for troubleshooting.