This repository contains the complete analysis pipeline for RNA-sequencing data from three song system nuclei of adult male zebra finch (Taeniopygia guttata):
- Area X - Basal ganglia nucleus involved in song learning
- RA (Robust nucleus of the arcopallium) - Motor control of song
- HVC - Sensorimotor integration
- Organism: Zebra finch (ID: or163), adult male
- Sequencing: Illumina paired-end, 150bp reads
- Samples:
- 12590-KS-1: Area X (94.7M reads)
- 12590-KS-2: RA (102.3M reads)
- 12590-KS-3: HVC (91.7M reads)
- Reference Genome: bTaeGut7.mat (maternal genome assembly)
- FastQC analysis of raw reads (
01_run_fastqc.sh) - Adapter trimming and quality filtering with fastp (
04_trim_reads_linux_fs.sh) - Post-trimming quality assessment (
05_run_fastqc_trimmed.sh) - Outputs:
results/qc/fastqc_raw/,results/qc/fastqc_trimmed/,results/qc/fastp_reports/
- HISAT2 splice-aware alignment to reference genome (
06_align_reads_hisat2_parallel_verbose.sh) - Optimized for parallel processing on WSL2
- Outputs:
alignments/*_aligned_sorted.bam,logs/alignment/
- featureCounts for gene-level expression (
07_count_features.sh) - Generation of count matrix with 60-65% assignment rate
- Outputs:
counts/gene_counts_matrix_named.txt,counts/library_statistics.txt
- DESeq2 analysis for pairwise comparisons (
08_run_deseq2.sh) - Fold-change based analysis (single replicates)
- Gene category analysis (
09_run_gene_categories.sh,10_run_separate_plots.sh) - Outputs:
results/differential_expression/,results/plots/
.
├── scripts/ # Analysis scripts (see SCRIPT_DOCUMENTATION.md for details)
│ ├── 01_run_fastqc.sh
│ ├── 02_build_index_minimal_memory.sh
│ ├── 03_check_read_length.sh
│ ├── 04_trim_reads_linux_fs.sh
│ ├── 05_run_fastqc_trimmed.sh
│ ├── 06_align_reads_hisat2_parallel_verbose.sh
│ ├── 07_count_features.sh
│ ├── 08_run_deseq2.sh & 08_deseq2_analysis.R
│ ├── 09_run_gene_categories.sh & 09_extract_gene_categories.R
│ └── 10_run_separate_plots.sh & 10_plot_gene_categories_separately.R
├── reference/ # Reference genome and annotations
│ ├── bTaeGut7.mat.gtf
│ └── hisat2_index/
├── results/
│ ├── qc/ # Quality control reports
│ ├── differential_expression/ # DE results and gene lists
│ └── plots/ # All visualization outputs
├── alignments/ # BAM files from HISAT2
├── counts/ # Gene expression matrices
├── logs/ # Processing logs
├── SCRIPT_DOCUMENTATION.md # Detailed script usage guide
└── environment.yml # Conda environment specification
- Conda/Mamba package manager
- WSL2 (for Windows users)
- 48GB RAM recommended
- 100GB free disk space
- FastQC v0.12.1
- fastp v0.23.4
- HISAT2 v2.2.1
- Samtools v1.20
- featureCounts (Subread v2.0.6)
- DESeq2 (R/Bioconductor)
- Clone the repository
git clone [repository-url]
cd RNAseq_data- Set up conda environment
conda env create -f environment_fixed.yml
conda activate rnaseq-zf- Download reference genome (if not included)
# Download from NCBI
# GCF_048771995.1 (bTaeGut7.mat)- Run the pipeline
# Check read lengths
bash scripts/03_check_read_length.sh
# Build genome index (memory-efficient version)
bash scripts/02_build_index_minimal_memory.sh
# Quality control on raw reads
bash scripts/01_run_fastqc.sh
# Trim adapters and filter reads
bash scripts/04_trim_reads_linux_fs.sh
# Quality control on trimmed reads
bash scripts/05_run_fastqc_trimmed.sh
# Align reads (parallel processing)
bash scripts/06_align_reads_hisat2_parallel_verbose.sh
# Count features
bash scripts/07_count_features.sh
# Differential expression analysis
bash scripts/08_run_deseq2.sh
# Analyze gene categories
bash scripts/09_run_gene_categories.sh
bash scripts/10_run_separate_plots.shSee SCRIPT_DOCUMENTATION.md for detailed usage instructions and parameters for each script.
- Optimized for WSL2 with Linux filesystem operations
- Parallel processing reduces alignment time by ~3x
- Memory-efficient HISAT2 indexing for systems with <30GB RAM
- Read retention: 98%+ reads passed quality filters
- Adapter removal: 100% of adapter contamination removed
- Quality improvement: Q30 scores increased from ~93% to ~94.5%
- Output location:
trimmed_data/, QC reports inresults/qc/fastp_reports/
- Alignment rates: 79-81% overall alignment across samples
- Processing time: 42 minutes using parallel processing
- Output location:
alignments/*_aligned_sorted.bam, statistics inlogs/alignment/
- Assignment rate: 60-65% of reads assigned to genes
- Genes detected: 17,700-18,600 out of 20,040 total genes
- Output location:
counts/gene_counts_matrix_named.txt,counts/library_statistics.txt
- Area X vs RA: 3,580 genes with >2-fold change
- Area X vs HVC: 3,503 genes with >2-fold change
- RA vs HVC: 2,241 genes with >2-fold change
- Key findings: FOXP2 enriched in Area X, PVALB in RA
- Output location:
results/differential_expression/, plots inresults/plots/
- Song system genes: Region-specific expression patterns confirmed
- Housekeeping genes: Variable stability across brain regions
- Output location: Expression plots in
results/plots/
For detailed information about each script including parameters, usage examples, and troubleshooting, see SCRIPT_DOCUMENTATION.md.
If you use this pipeline, please cite:
- HISAT2: Kim et al. (2019) Nature Biotechnology
- featureCounts: Liao et al. (2014) Bioinformatics
- DESeq2: Love et al. (2014) Genome Biology
- fastp: Chen et al. (2018) Bioinformatics
- Zebra finch genome: NCBI Assembly. Taeniopygia guttata isolate bTaeGut7 (zebra finch) genome assembly, bTaeGut7.mat. GCF_048771995.1. BioSample SAMN47142318. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2024 [cited 2025 Jun 04]. Available from: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_048771995.1/
Kate T Snyder Developed with assistance from Claude Sonnet 4.0
[KTS update when this is made public]