RNA-seq Analysis Pipeline Script Documentation

01_run_fastqc.sh
02_build_index_minimal_memory.sh
03_check_read_length.sh
04_trim_reads_linux_fs.sh
05_run_fastqc_trimmed.sh
06_align_reads_hisat2_parallel_verbose.sh
07_count_features.sh
08_run_deseq2.sh / 08_deseq2_analysis.R
09_run_gene_categories.sh / 09_extract_gene_categories.R
10_run_separate_plots.sh / 10_plot_gene_categories_separately.R
11_run_gene_list_prep.sh / 11_prepare_gene_lists_for_enrichment.R
12_run_enrichment_analysis.sh / 12_functional_enrichment_analysis.R

01_run_fastqc.sh

Description

Performs quality control analysis on raw sequencing reads using FastQC and generates an aggregate report with MultiQC.

Usage

bash scripts/01_run_fastqc.sh

Arguments

None

Input Files

raw_data/*.fastq.gz - Raw paired-end FASTQ files (symlinks to original data)

Output Files

results/qc/fastqc_raw/*_fastqc.html - Individual FastQC HTML reports
results/qc/fastqc_raw/*_fastqc.zip - FastQC data files
results/qc/multiqc_raw/multiqc_report.html - Aggregated MultiQC report

Dependencies

FastQC v0.12.1
MultiQC v1.14
conda environment: rnaseq-zf

Details

The script processes all FASTQ files in the raw_data directory using 8 threads per file. FastQC analyzes sequence quality, GC content, adapter contamination, overrepresented sequences, and other quality metrics. MultiQC aggregates all FastQC reports into a single interactive HTML report.

Examples

# Run quality control on raw reads
bash scripts/01_run_fastqc.sh
# View results in web browser
firefox results/qc/multiqc_raw/multiqc_report.html

Notes

Processing time: ~5-10 minutes for 6 files
Memory usage: ~2GB per file
Output files total ~100MB

02_build_index_minimal_memory.sh

Description

Builds HISAT2 genome index with minimal memory usage, optimized for systems with limited RAM.

Usage

bash scripts/02_build_index_minimal_memory.sh

Arguments

None

Input Files

~/rnaseq_workspace/reference/bTaeGut7.mat.fa - Reference genome FASTA
~/rnaseq_workspace/reference/bTaeGut7.mat.gtf - Gene annotations GTF

Output Files

~/rnaseq_workspace/reference/hisat2_index/*.ht2 - 8 HISAT2 index files
~/rnaseq_workspace/reference/hisat2_index/splice_sites.txt - Splice site coordinates
~/rnaseq_workspace/reference/hisat2_index/exons.txt - Exon coordinates
reference/hisat2_index/ - Copy in project directory

Dependencies

HISAT2 v2.2.1
Python (for GTF parsing scripts)
~8GB RAM minimum

Details

Uses 2 threads to minimize memory footprint while building a splice-aware index. Extracts splice sites and exons from GTF file for accurate RNA-seq alignment. Works entirely on Linux filesystem to avoid I/O bottlenecks.

Environment Variables

THREADS=2 - Number of threads (hardcoded for memory efficiency)

Examples

# Build index with minimal memory
bash scripts/02_build_index_minimal_memory.sh
# Check index files
ls -lh ~/rnaseq_workspace/reference/hisat2_index/

Notes

Processing time: 30-60 minutes
Memory usage: ~4-6GB peak
Index size: ~2.2GB total

03_check_read_length.sh

Description

Verifies the actual read length of FASTQ files by sampling sequences.

Usage

bash scripts/03_check_read_length.sh

Arguments

None

Input Files

raw_data/*.fastq.gz - Raw FASTQ files to check

Output

Console output showing read length distribution for each file

Dependencies

Standard Unix tools (zcat, awk, sort, uniq)

Details

Samples first 1000 reads from each FASTQ file and reports the length distribution. Essential for verifying expected read length before alignment parameter optimization.

Examples

bash scripts/03_check_read_length.sh
# Expected output:
# Checking read length for 12590-KS-1_S1_L005_R1_001.fastq.gz:
#    1000 151

Notes

Quick check: <1 minute total
Confirms 151bp read length for parameter settings

04_trim_reads_linux_fs.sh

Description

Performs adapter trimming and quality filtering using fastp with all I/O on Linux filesystem for maximum performance.

Usage

bash scripts/04_trim_reads_linux_fs.sh

Arguments

None

Input Files

~/rnaseq_workspace/raw_data/*_R1_001.fastq.gz - Forward reads
~/rnaseq_workspace/raw_data/*_R2_001.fastq.gz - Reverse reads

Output Files

~/rnaseq_workspace/trimmed_data/*_R[12]_trimmed.fastq.gz - Trimmed reads
~/rnaseq_workspace/qc/fastp_reports/*.html - Fastp QC reports
~/rnaseq_workspace/qc/fastp_reports/*.json - Fastp statistics
trimmed_data/ - Copy in project directory

Dependencies

fastp v0.23.4
~10GB free disk space

Parameters

--qualified_quality_phred 20 - Minimum base quality
--length_required 36 - Minimum read length
--detect_adapter_for_pe - Auto-detect adapters
--thread 8 - Processing threads

Details

Performs quality-based trimming, adapter removal, and filtering. Automatically detects and removes Illumina adapters. Generates detailed QC reports including insert size distribution.

Examples

# Run trimming on Linux filesystem
bash scripts/04_trim_reads_linux_fs.sh
# Check trimming statistics
grep "reads passed filter" ~/rnaseq_workspace/logs/*_fastp.log

Notes

Processing time: ~6.5 minutes total
~98% read retention rate
Output size: ~34GB

05_run_fastqc_trimmed.sh

Description

Performs quality control analysis on trimmed reads to assess improvement after processing.

Usage

bash scripts/05_run_fastqc_trimmed.sh

Arguments

None

Input Files

trimmed_data/*_trimmed.fastq.gz - Trimmed FASTQ files

Output Files

results/qc/fastqc_trimmed/*_fastqc.html - FastQC reports
results/qc/fastqc_trimmed/*_fastqc.zip - FastQC data
results/qc/multiqc_trimmed/*_multiqc_report.html - MultiQC summary

Dependencies

FastQC v0.12.1
MultiQC v1.14

Details

Identical analysis to 01_run_fastqc.sh but on trimmed reads. Enables direct comparison of quality metrics before and after trimming.

Examples

bash scripts/05_run_fastqc_trimmed.sh
# Compare with raw QC
diff results/qc/multiqc_raw/multiqc_data/multiqc_general_stats.txt \
     results/qc/multiqc_trimmed/*/multiqc_general_stats.txt

Notes

Verifies adapter removal success
Confirms quality score improvements

06_align_reads_hisat2_parallel_verbose.sh

Description

Performs splice-aware alignment of RNA-seq reads using HISAT2 with parallel processing and progress monitoring.

Usage

bash scripts/06_align_reads_hisat2_parallel_verbose.sh

Arguments

None

Input Files

~/rnaseq_workspace/trimmed_data/*_R[12]_trimmed.fastq.gz - Trimmed reads
~/rnaseq_workspace/reference/hisat2_index/bTaeGut7.mat - HISAT2 index

Output Files

~/rnaseq_workspace/alignments/*_aligned_sorted.bam - Sorted alignments
~/rnaseq_workspace/alignments/*_aligned_sorted.bam.bai - BAM indices
~/rnaseq_workspace/logs/alignment/*_alignment_summary.txt - HISAT2 statistics
~/rnaseq_workspace/logs/alignment/*_flagstat.txt - Samtools statistics
alignments/ and logs/alignment/ - Copies in project directory

Dependencies

HISAT2 v2.2.1
Samtools v1.20
~40GB free disk space

Parameters

-p 8 - Threads per sample
--dta - Downstream transcript assembly mode
--rna-strandness RF - Reverse stranded library
--maxins 800 - Maximum insert size
--no-softclip - Disable soft clipping

Details

Processes all three samples in parallel with real-time progress monitoring. Pipes HISAT2 output directly to samtools for sorting, avoiding large intermediate SAM files. Includes automatic indexing and comprehensive statistics generation.

Environment Variables

INDEX_PATH - Location of HISAT2 index (auto-detected)

Examples

# Run parallel alignment with progress monitoring
bash scripts/06_align_reads_hisat2_parallel_verbose.sh
# Check alignment rates
grep "overall alignment rate" logs/alignment/*_alignment_summary.txt

Notes

Processing time: ~42 minutes (parallel)
Memory usage: ~27GB total for 3 samples
Alignment rates: 79-81%

07_count_features.sh

Description

Quantifies gene expression by counting reads overlapping genomic features using featureCounts.

Usage

bash scripts/07_count_features.sh

Arguments

None

Input Files

alignments/*_aligned_sorted.bam - Sorted BAM files
reference/bTaeGut7.mat.gtf - Gene annotations

Output Files

counts/gene_counts.txt - Full count matrix with annotations
counts/gene_counts_matrix_named.txt - Simplified count matrix
counts/gene_counts.txt.summary - Assignment statistics
counts/library_statistics.txt - Sample-level statistics
counts/top_expressed_genes.txt - Highest expression genes
counts/count_summary_report.txt - Complete summary

Dependencies

featureCounts v2.0.6 (Subread)
R with base packages

Parameters

-t exon - Count exon features
-g gene_id - Group by gene
-p - Count fragments (paired-end)
-B - Both ends mapped
-C - Exclude chimeric reads
-s 2 - Reverse stranded
-T 8 - Number of threads

Details

Performs strand-specific quantification at gene level. Includes automatic calculation of library sizes, gene detection rates, and identification of highly expressed genes. Generates both full annotation matrix and simplified count matrix for downstream analysis.

Examples

bash scripts/07_count_features.sh
# Check assignment rate
grep "Assigned" counts/gene_counts.txt.summary
# View top genes
head counts/top_expressed_genes.txt

Notes

Processing time: 5-10 minutes
Assignment rate: 60-65%
Output matrix: 20,040 genes × 3 samples

08_run_deseq2.sh / 08_deseq2_analysis.R

Description

Performs differential expression analysis between brain regions using DESeq2, adapted for single-replicate design.

Usage

bash scripts/08_run_deseq2.sh

Arguments

None (R script arguments handled by wrapper)

Input Files

counts/gene_counts_matrix_named.txt - Gene count matrix
Gene annotations (internal to R)

Output Files

results/differential_expression/normalized_counts.txt - Size-factor normalized counts
results/differential_expression/DE_results_*.txt - Full comparison results
results/differential_expression/DE_2fold_*.txt - Genes with >2-fold change
results/differential_expression/analysis_summary.txt - Summary statistics
results/plots/PCA_brain_regions.pdf - PCA plot
results/plots/volcano_*.pdf - Volcano plots
results/plots/sample_distance_heatmap.pdf - Sample clustering
results/plots/top50_variable_genes_heatmap.pdf - Variable gene heatmap

Dependencies

R v4.3.3
DESeq2 v1.42.0
ggplot2, pheatmap, tidyverse, EnhancedVolcano

Parameters

Filter: ≥10 reads in ≥2 samples
Fold change threshold: |log2FC| > 1
No p-values (single replicates)

Details

Modified DESeq2 workflow for unreplicated data. Uses size factor normalization and variance stabilizing transformation for visualization. Performs all pairwise comparisons and identifies differentially expressed genes based on fold change criteria.

R Functions

perform_comparison() - Pairwise fold change analysis
create_single_scale_plot() - Visualization functions

Examples

# Run differential expression analysis
bash scripts/08_run_deseq2.sh
# View summary
cat results/differential_expression/analysis_summary.txt
# Check specific comparison
head results/differential_expression/DE_2fold_Area_X_vs_RA.txt

Notes

No statistical significance testing
Fold change threshold: 2-fold
~3,500 genes differ between regions

09_run_gene_categories.sh / 09_extract_gene_categories.R

Description

Extracts and analyzes expression patterns of categorized genes (housekeeping and song system genes).

Usage

bash scripts/09_run_gene_categories.sh

Arguments

None

Input Files

results/differential_expression/gene_categories_for_plotting.csv - Gene categories
results/differential_expression/normalized_counts.txt - Expression data

Output Files

results/differential_expression/categorized_gene_expression.csv - Combined data
results/plots/housekeeping_gene_stability.pdf - Stability analysis
results/plots/gene_categories_heatmap.pdf - Expression heatmap
results/plots/category_expression_comparison.pdf - Category comparison
results/plots/key_genes_expression.pdf - Selected gene plots

Dependencies

R with tidyverse, ggplot2, pheatmap

Gene Categories

Housekeeping: Olias2014 (PGK1, RPS7, TFRC, YWHAZ) and ZinzowKramer2014 (HPRT1, RPL4, PPIA)
Song System: Transcription factors, immediate early genes, neurotrophic factors, receptors, etc.

Details

Analyzes stability of housekeeping genes across brain regions and expression patterns of song system genes. Calculates coefficient of variation for stability assessment and generates multiple visualization formats.

Examples

bash scripts/09_run_gene_categories.sh
# Most stable housekeeping gene will be shown in console output

Notes

Identifies region-specific marker genes
Validates housekeeping gene assumptions

10_run_separate_plots.sh / 10_plot_gene_categories_separately.R

Description

Creates publication-quality expression plots for gene categories with multiple visualization options.

Usage

bash scripts/10_run_separate_plots.sh

Arguments

None

Input Files

results/differential_expression/gene_categories_for_plotting.csv - Gene list
results/differential_expression/normalized_counts.txt - Expression values

Output Files

Multi-panel plots (alphabetical):

results/plots/song_system_genes_expression_separated.pdf
results/plots/housekeeping_genes_expression.pdf

Single-scale plots (by expression):

results/plots/song_system_genes_single_scale.pdf
results/plots/housekeeping_genes_single_scale.pdf
results/plots/song_system_genes_log_scale.pdf
results/plots/housekeeping_genes_log_scale.pdf

Subcategory plots:

results/plots/song_system_*_genes.pdf

Dependencies

R with ggplot2, tidyverse

Plot Types

Multi-panel: Each gene in separate panel, alphabetical order
Single-scale: All genes on one plot, ordered by expression
Log-scale: Same as single-scale with log10 y-axis
Subcategory: Grouped by functional annotation

Color Scheme

Area X: #E69F00 (orange)
RA: #56B4E9 (sky blue)
HVC: #009E73 (teal)

Details

Generates multiple visualization formats optimized for different purposes. Multi-panel plots show individual patterns, single-scale enables direct comparison, log-scale handles wide expression ranges. All plots use colorblind-friendly palettes.

Examples

bash scripts/10_run_separate_plots.sh
# View expression ordered by level
evince results/plots/song_system_genes_single_scale.pdf

Notes

Horizontal bar orientation for readability
Dynamic plot sizing based on gene count
Consistent color scheme across all plots

11_run_gene_list_prep.sh / 11_prepare_gene_lists_for_enrichment.R

Description

Prepares gene lists for functional enrichment analysis by identifying region-enriched genes based on fold-change criteria.

Usage

bash scripts/11_run_gene_list_prep.sh

Arguments

None

Input Files

results/differential_expression/DE_results_Area_X_vs_RA.txt
results/differential_expression/DE_results_Area_X_vs_HVC.txt
results/differential_expression/DE_results_RA_vs_HVC.txt

Output Files

Region-enriched gene lists:

results/enrichment_analysis/area_x_enriched_genes.txt - Genes higher in Area X vs both RA and HVC
results/enrichment_analysis/ra_enriched_genes.txt - Genes higher in RA vs both Area X and HVC
results/enrichment_analysis/hvc_enriched_genes.txt - Genes higher in HVC vs both Area X and RA
results/enrichment_analysis/background_genes.txt - All tested genes (for statistical background)

Full tables with fold changes:

results/enrichment_analysis/*_enriched_genes_full.txt - Includes fold change values

Pairwise comparisons:

results/enrichment_analysis/area_x_up_vs_ra_genes.txt
results/enrichment_analysis/area_x_up_vs_hvc_genes.txt
results/enrichment_analysis/ra_up_vs_hvc_genes.txt

Dependencies

R with tidyverse

Parameters

FC_THRESHOLD = 1 - Log2 fold change threshold (2-fold linear)

Details

Identifies genes specifically enriched in each brain region using a stringent criterion: a gene must be >2-fold higher in one region compared to BOTH other regions. Also extracts genes from pairwise comparisons for broader analysis. Outputs include both simple gene lists and detailed tables with fold change information.

Examples

bash scripts/11_run_gene_list_prep.sh
# Check number of enriched genes
wc -l results/enrichment_analysis/*_enriched_genes.txt

Notes

Region-enriched lists are more stringent than pairwise comparisons
Background gene list used for enrichment statistics
Summary statistics printed to console

12_run_enrichment_analysis.sh / 12_functional_enrichment_analysis.R

Description

Maps zebra finch genes to human orthologs and performs functional enrichment analysis using GO terms and KEGG pathways.

Usage

bash scripts/12_run_enrichment_analysis.sh

Arguments

None

Input Files

results/enrichment_analysis/area_x_enriched_genes.txt
results/enrichment_analysis/ra_enriched_genes.txt
results/enrichment_analysis/hvc_enriched_genes.txt
results/enrichment_analysis/background_genes.txt

Output Files

Ortholog mappings:

results/enrichment_analysis/*_human_orthologs.txt - Zebra finch to human gene mappings

GO enrichment results:

results/enrichment_analysis/GO_results/*_GO_BP.txt - Biological Process
results/enrichment_analysis/GO_results/*_GO_CC.txt - Cellular Component
results/enrichment_analysis/GO_results/*_GO_MF.txt - Molecular Function

KEGG results:

results/enrichment_analysis/KEGG_results/*_KEGG.txt - Pathway enrichment

Visualizations:

results/plots/enrichment/*_GO_BP_barplot.pdf - Top 20 GO terms
results/plots/enrichment/*_GO_BP_dotplot.pdf - GO enrichment dot plots
results/plots/enrichment/*_KEGG_dotplot.pdf - KEGG pathway plots

Summary:

results/enrichment_analysis/enrichment_summary.txt - Top enriched terms for all regions

Dependencies

R packages: biomaRt, clusterProfiler, org.Hs.eg.db, enrichplot, ggplot2, tidyverse
Internet connection (for BioMart queries)

Parameters

pvalueCutoff = 0.05 - Enrichment p-value threshold
qvalueCutoff = 0.2 - FDR threshold
Ortholog filters: one-to-one and one-to-many only

Details

Connects to Ensembl BioMart to retrieve human orthologs for zebra finch genes. Only high-confidence orthologs are retained. Performs hypergeometric testing for GO term and KEGG pathway enrichment using the clusterProfiler package. Background gene set is used for proper statistical testing.

R Functions

map_to_human_orthologs() - BioMart ortholog mapping
perform_enrichment() - GO and KEGG analysis
create_enrichment_plots() - Visualization generation

Examples

# Check if required packages are installed first
bash scripts/12_run_enrichment_analysis.sh
# If packages missing, install with:
Rscript scripts/install_enrichment_packages.R
# Then run analysis
bash scripts/12_run_enrichment_analysis.sh

Notes

BioMart queries can take several minutes
Not all zebra finch genes have human orthologs
Enrichment p-values are valid despite single replicates
Internet required for Ensembl database access

General Pipeline Notes

Environment Setup

All scripts require activation of the conda environment:

conda activate rnaseq-zf

File Organization

Scripts: scripts/
Raw data: raw_data/ (symlinks)
Linux FS workspace: ~/rnaseq_workspace/
Results: results/, counts/, alignments/

Performance Optimization

Linux filesystem used for I/O intensive operations
Parallel processing where applicable
Memory-efficient parameter choices

Error Handling

All scripts use set -euo pipefail for robust error handling and will exit on first error.

Logging

Most scripts generate detailed logs in logs/ subdirectories for troubleshooting.

Uh oh!

FilesExpand file tree

SCRIPT_DOCUMENTATION.md

Latest commit

History

SCRIPT_DOCUMENTATION.md

File metadata and controls

RNA-seq Analysis Pipeline Script Documentation

Table of Contents

01_run_fastqc.sh

Description

Usage

Arguments

Input Files

Output Files

Dependencies

Details

Examples

Notes

02_build_index_minimal_memory.sh

Description

Usage

Arguments

Input Files

Output Files

Dependencies

Details

Environment Variables

Examples

Notes

03_check_read_length.sh

Description

Usage

Arguments

Input Files

Output

Dependencies

Details

Examples

Notes

04_trim_reads_linux_fs.sh

Description

Usage

Arguments

Input Files

Output Files

Dependencies

Parameters

Details

Examples

Notes

05_run_fastqc_trimmed.sh

Description

Usage

Arguments

Input Files

Output Files

Dependencies

Details

Examples

Notes

06_align_reads_hisat2_parallel_verbose.sh

Description

Usage

Arguments

Input Files

Output Files

Dependencies

Parameters

Details

Environment Variables

Examples

Notes

07_count_features.sh

Description

Usage

Arguments

Input Files

Output Files

Dependencies