Skip to content

CreanzaLab/RNAseq_ZF_pipeline

Repository files navigation

Zebra Finch Song System RNA-seq Analysis

Overview

This repository contains the complete analysis pipeline for RNA-sequencing data from three song system nuclei of adult male zebra finch (Taeniopygia guttata):

  • Area X - Basal ganglia nucleus involved in song learning
  • RA (Robust nucleus of the arcopallium) - Motor control of song
  • HVC - Sensorimotor integration

Current Data Description

  • Organism: Zebra finch (ID: or163), adult male
  • Sequencing: Illumina paired-end, 150bp reads
  • Samples:
    • 12590-KS-1: Area X (94.7M reads)
    • 12590-KS-2: RA (102.3M reads)
    • 12590-KS-3: HVC (91.7M reads)
  • Reference Genome: bTaeGut7.mat (maternal genome assembly)

Pipeline Overview

1. Quality Control

  • FastQC analysis of raw reads (01_run_fastqc.sh)
  • Adapter trimming and quality filtering with fastp (04_trim_reads_linux_fs.sh)
  • Post-trimming quality assessment (05_run_fastqc_trimmed.sh)
  • Outputs: results/qc/fastqc_raw/, results/qc/fastqc_trimmed/, results/qc/fastp_reports/

2. Alignment

  • HISAT2 splice-aware alignment to reference genome (06_align_reads_hisat2_parallel_verbose.sh)
  • Optimized for parallel processing on WSL2
  • Outputs: alignments/*_aligned_sorted.bam, logs/alignment/

3. Quantification

  • featureCounts for gene-level expression (07_count_features.sh)
  • Generation of count matrix with 60-65% assignment rate
  • Outputs: counts/gene_counts_matrix_named.txt, counts/library_statistics.txt

4. Differential Expression

  • DESeq2 analysis for pairwise comparisons (08_run_deseq2.sh)
  • Fold-change based analysis (single replicates)
  • Gene category analysis (09_run_gene_categories.sh, 10_run_separate_plots.sh)
  • Outputs: results/differential_expression/, results/plots/

Repository Structure

.
├── scripts/                 # Analysis scripts (see SCRIPT_DOCUMENTATION.md for details)
│   ├── 01_run_fastqc.sh
│   ├── 02_build_index_minimal_memory.sh
│   ├── 03_check_read_length.sh
│   ├── 04_trim_reads_linux_fs.sh
│   ├── 05_run_fastqc_trimmed.sh
│   ├── 06_align_reads_hisat2_parallel_verbose.sh
│   ├── 07_count_features.sh
│   ├── 08_run_deseq2.sh & 08_deseq2_analysis.R
│   ├── 09_run_gene_categories.sh & 09_extract_gene_categories.R
│   └── 10_run_separate_plots.sh & 10_plot_gene_categories_separately.R
├── reference/              # Reference genome and annotations
│   ├── bTaeGut7.mat.gtf
│   └── hisat2_index/
├── results/
│   ├── qc/                # Quality control reports
│   ├── differential_expression/  # DE results and gene lists
│   └── plots/             # All visualization outputs
├── alignments/            # BAM files from HISAT2
├── counts/                # Gene expression matrices
├── logs/                  # Processing logs
├── SCRIPT_DOCUMENTATION.md  # Detailed script usage guide
└── environment.yml        # Conda environment specification

Software Requirements

  • Conda/Mamba package manager
  • WSL2 (for Windows users)
  • 48GB RAM recommended
  • 100GB free disk space

Key Tools

  • FastQC v0.12.1
  • fastp v0.23.4
  • HISAT2 v2.2.1
  • Samtools v1.20
  • featureCounts (Subread v2.0.6)
  • DESeq2 (R/Bioconductor)

Quick Start

  1. Clone the repository
git clone [repository-url]
cd RNAseq_data
  1. Set up conda environment
conda env create -f environment_fixed.yml
conda activate rnaseq-zf
  1. Download reference genome (if not included)
# Download from NCBI
# GCF_048771995.1 (bTaeGut7.mat)
  1. Run the pipeline
# Check read lengths
bash scripts/03_check_read_length.sh

# Build genome index (memory-efficient version)
bash scripts/02_build_index_minimal_memory.sh

# Quality control on raw reads
bash scripts/01_run_fastqc.sh

# Trim adapters and filter reads
bash scripts/04_trim_reads_linux_fs.sh

# Quality control on trimmed reads
bash scripts/05_run_fastqc_trimmed.sh

# Align reads (parallel processing)
bash scripts/06_align_reads_hisat2_parallel_verbose.sh

# Count features
bash scripts/07_count_features.sh

# Differential expression analysis
bash scripts/08_run_deseq2.sh

# Analyze gene categories
bash scripts/09_run_gene_categories.sh
bash scripts/10_run_separate_plots.sh

See SCRIPT_DOCUMENTATION.md for detailed usage instructions and parameters for each script.

Performance Notes

  • Optimized for WSL2 with Linux filesystem operations
  • Parallel processing reduces alignment time by ~3x
  • Memory-efficient HISAT2 indexing for systems with <30GB RAM

Results Summary

Quality Control & Trimming (04_trim_reads_linux_fs.sh)

  • Read retention: 98%+ reads passed quality filters
  • Adapter removal: 100% of adapter contamination removed
  • Quality improvement: Q30 scores increased from ~93% to ~94.5%
  • Output location: trimmed_data/, QC reports in results/qc/fastp_reports/

Alignment (06_align_reads_hisat2_parallel_verbose.sh)

  • Alignment rates: 79-81% overall alignment across samples
  • Processing time: 42 minutes using parallel processing
  • Output location: alignments/*_aligned_sorted.bam, statistics in logs/alignment/

Gene Expression Quantification (07_count_features.sh)

  • Assignment rate: 60-65% of reads assigned to genes
  • Genes detected: 17,700-18,600 out of 20,040 total genes
  • Output location: counts/gene_counts_matrix_named.txt, counts/library_statistics.txt

Differential Expression (08_run_deseq2.sh)

  • Area X vs RA: 3,580 genes with >2-fold change
  • Area X vs HVC: 3,503 genes with >2-fold change
  • RA vs HVC: 2,241 genes with >2-fold change
  • Key findings: FOXP2 enriched in Area X, PVALB in RA
  • Output location: results/differential_expression/, plots in results/plots/

Gene Category Analysis (09_run_gene_categories.sh, 10_run_separate_plots.sh)

  • Song system genes: Region-specific expression patterns confirmed
  • Housekeeping genes: Variable stability across brain regions
  • Output location: Expression plots in results/plots/

Documentation

For detailed information about each script including parameters, usage examples, and troubleshooting, see SCRIPT_DOCUMENTATION.md.

Citation

If you use this pipeline, please cite:

  • HISAT2: Kim et al. (2019) Nature Biotechnology
  • featureCounts: Liao et al. (2014) Bioinformatics
  • DESeq2: Love et al. (2014) Genome Biology
  • fastp: Chen et al. (2018) Bioinformatics
  • Zebra finch genome: NCBI Assembly. Taeniopygia guttata isolate bTaeGut7 (zebra finch) genome assembly, bTaeGut7.mat. GCF_048771995.1. BioSample SAMN47142318. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2024 [cited 2025 Jun 04]. Available from: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_048771995.1/

Contact

Kate T Snyder Developed with assistance from Claude Sonnet 4.0

License

[KTS update when this is made public]

About

This repository contains the complete analysis pipeline for RNA-sequencing data from male zebra finch (*Taeniopygia guttata*) brain tissue from projects performed in the Creanza Lab at Vanderbilt University

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages