- 01_run_fastqc.sh
- 02_build_index_minimal_memory.sh
- 03_check_read_length.sh
- 04_trim_reads_linux_fs.sh
- 05_run_fastqc_trimmed.sh
- 06_align_reads_hisat2_parallel_verbose.sh
- 07_count_features.sh
- 08_run_deseq2.sh / 08_deseq2_analysis.R
- 09_run_gene_categories.sh / 09_extract_gene_categories.R
- 10_run_separate_plots.sh / 10_plot_gene_categories_separately.R
- 11_run_gene_list_prep.sh / 11_prepare_gene_lists_for_enrichment.R
- 12_run_enrichment_analysis.sh / 12_functional_enrichment_analysis.R
Performs quality control analysis on raw sequencing reads using FastQC and generates an aggregate report with MultiQC.
bash scripts/01_run_fastqc.shNone
raw_data/*.fastq.gz- Raw paired-end FASTQ files (symlinks to original data)
results/qc/fastqc_raw/*_fastqc.html- Individual FastQC HTML reportsresults/qc/fastqc_raw/*_fastqc.zip- FastQC data filesresults/qc/multiqc_raw/multiqc_report.html- Aggregated MultiQC report
- FastQC v0.12.1
- MultiQC v1.14
- conda environment: rnaseq-zf
The script processes all FASTQ files in the raw_data directory using 8 threads per file. FastQC analyzes sequence quality, GC content, adapter contamination, overrepresented sequences, and other quality metrics. MultiQC aggregates all FastQC reports into a single interactive HTML report.
# Run quality control on raw reads
bash scripts/01_run_fastqc.sh
# View results in web browser
firefox results/qc/multiqc_raw/multiqc_report.html- Processing time: ~5-10 minutes for 6 files
- Memory usage: ~2GB per file
- Output files total ~100MB
Builds HISAT2 genome index with minimal memory usage, optimized for systems with limited RAM.
bash scripts/02_build_index_minimal_memory.shNone
~/rnaseq_workspace/reference/bTaeGut7.mat.fa- Reference genome FASTA~/rnaseq_workspace/reference/bTaeGut7.mat.gtf- Gene annotations GTF
~/rnaseq_workspace/reference/hisat2_index/*.ht2- 8 HISAT2 index files~/rnaseq_workspace/reference/hisat2_index/splice_sites.txt- Splice site coordinates~/rnaseq_workspace/reference/hisat2_index/exons.txt- Exon coordinatesreference/hisat2_index/- Copy in project directory
- HISAT2 v2.2.1
- Python (for GTF parsing scripts)
- ~8GB RAM minimum
Uses 2 threads to minimize memory footprint while building a splice-aware index. Extracts splice sites and exons from GTF file for accurate RNA-seq alignment. Works entirely on Linux filesystem to avoid I/O bottlenecks.
THREADS=2- Number of threads (hardcoded for memory efficiency)
# Build index with minimal memory
bash scripts/02_build_index_minimal_memory.sh
# Check index files
ls -lh ~/rnaseq_workspace/reference/hisat2_index/- Processing time: 30-60 minutes
- Memory usage: ~4-6GB peak
- Index size: ~2.2GB total
Verifies the actual read length of FASTQ files by sampling sequences.
bash scripts/03_check_read_length.shNone
raw_data/*.fastq.gz- Raw FASTQ files to check
Console output showing read length distribution for each file
- Standard Unix tools (zcat, awk, sort, uniq)
Samples first 1000 reads from each FASTQ file and reports the length distribution. Essential for verifying expected read length before alignment parameter optimization.
bash scripts/03_check_read_length.sh
# Expected output:
# Checking read length for 12590-KS-1_S1_L005_R1_001.fastq.gz:
# 1000 151- Quick check: <1 minute total
- Confirms 151bp read length for parameter settings
Performs adapter trimming and quality filtering using fastp with all I/O on Linux filesystem for maximum performance.
bash scripts/04_trim_reads_linux_fs.shNone
~/rnaseq_workspace/raw_data/*_R1_001.fastq.gz- Forward reads~/rnaseq_workspace/raw_data/*_R2_001.fastq.gz- Reverse reads
~/rnaseq_workspace/trimmed_data/*_R[12]_trimmed.fastq.gz- Trimmed reads~/rnaseq_workspace/qc/fastp_reports/*.html- Fastp QC reports~/rnaseq_workspace/qc/fastp_reports/*.json- Fastp statisticstrimmed_data/- Copy in project directory
- fastp v0.23.4
- ~10GB free disk space
--qualified_quality_phred 20- Minimum base quality--length_required 36- Minimum read length--detect_adapter_for_pe- Auto-detect adapters--thread 8- Processing threads
Performs quality-based trimming, adapter removal, and filtering. Automatically detects and removes Illumina adapters. Generates detailed QC reports including insert size distribution.
# Run trimming on Linux filesystem
bash scripts/04_trim_reads_linux_fs.sh
# Check trimming statistics
grep "reads passed filter" ~/rnaseq_workspace/logs/*_fastp.log- Processing time: ~6.5 minutes total
- ~98% read retention rate
- Output size: ~34GB
Performs quality control analysis on trimmed reads to assess improvement after processing.
bash scripts/05_run_fastqc_trimmed.shNone
trimmed_data/*_trimmed.fastq.gz- Trimmed FASTQ files
results/qc/fastqc_trimmed/*_fastqc.html- FastQC reportsresults/qc/fastqc_trimmed/*_fastqc.zip- FastQC dataresults/qc/multiqc_trimmed/*_multiqc_report.html- MultiQC summary
- FastQC v0.12.1
- MultiQC v1.14
Identical analysis to 01_run_fastqc.sh but on trimmed reads. Enables direct comparison of quality metrics before and after trimming.
bash scripts/05_run_fastqc_trimmed.sh
# Compare with raw QC
diff results/qc/multiqc_raw/multiqc_data/multiqc_general_stats.txt \
results/qc/multiqc_trimmed/*/multiqc_general_stats.txt- Verifies adapter removal success
- Confirms quality score improvements
Performs splice-aware alignment of RNA-seq reads using HISAT2 with parallel processing and progress monitoring.
bash scripts/06_align_reads_hisat2_parallel_verbose.shNone
~/rnaseq_workspace/trimmed_data/*_R[12]_trimmed.fastq.gz- Trimmed reads~/rnaseq_workspace/reference/hisat2_index/bTaeGut7.mat- HISAT2 index
~/rnaseq_workspace/alignments/*_aligned_sorted.bam- Sorted alignments~/rnaseq_workspace/alignments/*_aligned_sorted.bam.bai- BAM indices~/rnaseq_workspace/logs/alignment/*_alignment_summary.txt- HISAT2 statistics~/rnaseq_workspace/logs/alignment/*_flagstat.txt- Samtools statisticsalignments/andlogs/alignment/- Copies in project directory
- HISAT2 v2.2.1
- Samtools v1.20
- ~40GB free disk space
-p 8- Threads per sample--dta- Downstream transcript assembly mode--rna-strandness RF- Reverse stranded library--maxins 800- Maximum insert size--no-softclip- Disable soft clipping
Processes all three samples in parallel with real-time progress monitoring. Pipes HISAT2 output directly to samtools for sorting, avoiding large intermediate SAM files. Includes automatic indexing and comprehensive statistics generation.
INDEX_PATH- Location of HISAT2 index (auto-detected)
# Run parallel alignment with progress monitoring
bash scripts/06_align_reads_hisat2_parallel_verbose.sh
# Check alignment rates
grep "overall alignment rate" logs/alignment/*_alignment_summary.txt- Processing time: ~42 minutes (parallel)
- Memory usage: ~27GB total for 3 samples
- Alignment rates: 79-81%
Quantifies gene expression by counting reads overlapping genomic features using featureCounts.
bash scripts/07_count_features.shNone
alignments/*_aligned_sorted.bam- Sorted BAM filesreference/bTaeGut7.mat.gtf- Gene annotations
counts/gene_counts.txt- Full count matrix with annotationscounts/gene_counts_matrix_named.txt- Simplified count matrixcounts/gene_counts.txt.summary- Assignment statisticscounts/library_statistics.txt- Sample-level statisticscounts/top_expressed_genes.txt- Highest expression genescounts/count_summary_report.txt- Complete summary
- featureCounts v2.0.6 (Subread)
- R with base packages
-t exon- Count exon features-g gene_id- Group by gene-p- Count fragments (paired-end)-B- Both ends mapped-C- Exclude chimeric reads-s 2- Reverse stranded-T 8- Number of threads
Performs strand-specific quantification at gene level. Includes automatic calculation of library sizes, gene detection rates, and identification of highly expressed genes. Generates both full annotation matrix and simplified count matrix for downstream analysis.
bash scripts/07_count_features.sh
# Check assignment rate
grep "Assigned" counts/gene_counts.txt.summary
# View top genes
head counts/top_expressed_genes.txt- Processing time: 5-10 minutes
- Assignment rate: 60-65%
- Output matrix: 20,040 genes × 3 samples
Performs differential expression analysis between brain regions using DESeq2, adapted for single-replicate design.
bash scripts/08_run_deseq2.shNone (R script arguments handled by wrapper)
counts/gene_counts_matrix_named.txt- Gene count matrix- Gene annotations (internal to R)
results/differential_expression/normalized_counts.txt- Size-factor normalized countsresults/differential_expression/DE_results_*.txt- Full comparison resultsresults/differential_expression/DE_2fold_*.txt- Genes with >2-fold changeresults/differential_expression/analysis_summary.txt- Summary statisticsresults/plots/PCA_brain_regions.pdf- PCA plotresults/plots/volcano_*.pdf- Volcano plotsresults/plots/sample_distance_heatmap.pdf- Sample clusteringresults/plots/top50_variable_genes_heatmap.pdf- Variable gene heatmap
- R v4.3.3
- DESeq2 v1.42.0
- ggplot2, pheatmap, tidyverse, EnhancedVolcano
- Filter: ≥10 reads in ≥2 samples
- Fold change threshold: |log2FC| > 1
- No p-values (single replicates)
Modified DESeq2 workflow for unreplicated data. Uses size factor normalization and variance stabilizing transformation for visualization. Performs all pairwise comparisons and identifies differentially expressed genes based on fold change criteria.
perform_comparison()- Pairwise fold change analysiscreate_single_scale_plot()- Visualization functions
# Run differential expression analysis
bash scripts/08_run_deseq2.sh
# View summary
cat results/differential_expression/analysis_summary.txt
# Check specific comparison
head results/differential_expression/DE_2fold_Area_X_vs_RA.txt- No statistical significance testing
- Fold change threshold: 2-fold
- ~3,500 genes differ between regions
Extracts and analyzes expression patterns of categorized genes (housekeeping and song system genes).
bash scripts/09_run_gene_categories.shNone
results/differential_expression/gene_categories_for_plotting.csv- Gene categoriesresults/differential_expression/normalized_counts.txt- Expression data
results/differential_expression/categorized_gene_expression.csv- Combined dataresults/plots/housekeeping_gene_stability.pdf- Stability analysisresults/plots/gene_categories_heatmap.pdf- Expression heatmapresults/plots/category_expression_comparison.pdf- Category comparisonresults/plots/key_genes_expression.pdf- Selected gene plots
- R with tidyverse, ggplot2, pheatmap
- Housekeeping: Olias2014 (PGK1, RPS7, TFRC, YWHAZ) and ZinzowKramer2014 (HPRT1, RPL4, PPIA)
- Song System: Transcription factors, immediate early genes, neurotrophic factors, receptors, etc.
Analyzes stability of housekeeping genes across brain regions and expression patterns of song system genes. Calculates coefficient of variation for stability assessment and generates multiple visualization formats.
bash scripts/09_run_gene_categories.sh
# Most stable housekeeping gene will be shown in console output- Identifies region-specific marker genes
- Validates housekeeping gene assumptions
Creates publication-quality expression plots for gene categories with multiple visualization options.
bash scripts/10_run_separate_plots.shNone
results/differential_expression/gene_categories_for_plotting.csv- Gene listresults/differential_expression/normalized_counts.txt- Expression values
Multi-panel plots (alphabetical):
results/plots/song_system_genes_expression_separated.pdfresults/plots/housekeeping_genes_expression.pdf
Single-scale plots (by expression):
results/plots/song_system_genes_single_scale.pdfresults/plots/housekeeping_genes_single_scale.pdfresults/plots/song_system_genes_log_scale.pdfresults/plots/housekeeping_genes_log_scale.pdf
Subcategory plots:
results/plots/song_system_*_genes.pdf
- R with ggplot2, tidyverse
- Multi-panel: Each gene in separate panel, alphabetical order
- Single-scale: All genes on one plot, ordered by expression
- Log-scale: Same as single-scale with log10 y-axis
- Subcategory: Grouped by functional annotation
- Area X: #E69F00 (orange)
- RA: #56B4E9 (sky blue)
- HVC: #009E73 (teal)
Generates multiple visualization formats optimized for different purposes. Multi-panel plots show individual patterns, single-scale enables direct comparison, log-scale handles wide expression ranges. All plots use colorblind-friendly palettes.
bash scripts/10_run_separate_plots.sh
# View expression ordered by level
evince results/plots/song_system_genes_single_scale.pdf- Horizontal bar orientation for readability
- Dynamic plot sizing based on gene count
- Consistent color scheme across all plots
Prepares gene lists for functional enrichment analysis by identifying region-enriched genes based on fold-change criteria.
bash scripts/11_run_gene_list_prep.shNone
results/differential_expression/DE_results_Area_X_vs_RA.txtresults/differential_expression/DE_results_Area_X_vs_HVC.txtresults/differential_expression/DE_results_RA_vs_HVC.txt
Region-enriched gene lists:
results/enrichment_analysis/area_x_enriched_genes.txt- Genes higher in Area X vs both RA and HVCresults/enrichment_analysis/ra_enriched_genes.txt- Genes higher in RA vs both Area X and HVCresults/enrichment_analysis/hvc_enriched_genes.txt- Genes higher in HVC vs both Area X and RAresults/enrichment_analysis/background_genes.txt- All tested genes (for statistical background)
Full tables with fold changes:
results/enrichment_analysis/*_enriched_genes_full.txt- Includes fold change values
Pairwise comparisons:
results/enrichment_analysis/area_x_up_vs_ra_genes.txtresults/enrichment_analysis/area_x_up_vs_hvc_genes.txtresults/enrichment_analysis/ra_up_vs_hvc_genes.txt
- R with tidyverse
FC_THRESHOLD = 1- Log2 fold change threshold (2-fold linear)
Identifies genes specifically enriched in each brain region using a stringent criterion: a gene must be >2-fold higher in one region compared to BOTH other regions. Also extracts genes from pairwise comparisons for broader analysis. Outputs include both simple gene lists and detailed tables with fold change information.
bash scripts/11_run_gene_list_prep.sh
# Check number of enriched genes
wc -l results/enrichment_analysis/*_enriched_genes.txt- Region-enriched lists are more stringent than pairwise comparisons
- Background gene list used for enrichment statistics
- Summary statistics printed to console
Maps zebra finch genes to human orthologs and performs functional enrichment analysis using GO terms and KEGG pathways.
bash scripts/12_run_enrichment_analysis.shNone
results/enrichment_analysis/area_x_enriched_genes.txtresults/enrichment_analysis/ra_enriched_genes.txtresults/enrichment_analysis/hvc_enriched_genes.txtresults/enrichment_analysis/background_genes.txt
Ortholog mappings:
results/enrichment_analysis/*_human_orthologs.txt- Zebra finch to human gene mappings
GO enrichment results:
results/enrichment_analysis/GO_results/*_GO_BP.txt- Biological Processresults/enrichment_analysis/GO_results/*_GO_CC.txt- Cellular Componentresults/enrichment_analysis/GO_results/*_GO_MF.txt- Molecular Function
KEGG results:
results/enrichment_analysis/KEGG_results/*_KEGG.txt- Pathway enrichment
Visualizations:
results/plots/enrichment/*_GO_BP_barplot.pdf- Top 20 GO termsresults/plots/enrichment/*_GO_BP_dotplot.pdf- GO enrichment dot plotsresults/plots/enrichment/*_KEGG_dotplot.pdf- KEGG pathway plots
Summary:
results/enrichment_analysis/enrichment_summary.txt- Top enriched terms for all regions
- R packages: biomaRt, clusterProfiler, org.Hs.eg.db, enrichplot, ggplot2, tidyverse
- Internet connection (for BioMart queries)
pvalueCutoff = 0.05- Enrichment p-value thresholdqvalueCutoff = 0.2- FDR threshold- Ortholog filters: one-to-one and one-to-many only
Connects to Ensembl BioMart to retrieve human orthologs for zebra finch genes. Only high-confidence orthologs are retained. Performs hypergeometric testing for GO term and KEGG pathway enrichment using the clusterProfiler package. Background gene set is used for proper statistical testing.
map_to_human_orthologs()- BioMart ortholog mappingperform_enrichment()- GO and KEGG analysiscreate_enrichment_plots()- Visualization generation
# Check if required packages are installed first
bash scripts/12_run_enrichment_analysis.sh
# If packages missing, install with:
Rscript scripts/install_enrichment_packages.R
# Then run analysis
bash scripts/12_run_enrichment_analysis.sh- BioMart queries can take several minutes
- Not all zebra finch genes have human orthologs
- Enrichment p-values are valid despite single replicates
- Internet required for Ensembl database access
All scripts require activation of the conda environment:
conda activate rnaseq-zf- Scripts:
scripts/ - Raw data:
raw_data/(symlinks) - Linux FS workspace:
~/rnaseq_workspace/ - Results:
results/,counts/,alignments/
- Linux filesystem used for I/O intensive operations
- Parallel processing where applicable
- Memory-efficient parameter choices
All scripts use set -euo pipefail for robust error handling and will exit on first error.
Most scripts generate detailed logs in logs/ subdirectories for troubleshooting.