Project Setup - Beta
This page covers everything needed to configure and run the pipeline from scratch: obtaining the data, installing software dependencies, and configuring the environment.
Prerequisites
The pipeline is designed to run on a SLURM HPC cluster with access to Conda and Singularity. The following must be available on your system before proceeding:
| Requirement | Version | Notes |
|---|---|---|
| Snakemake | ≥ 8.0 | Workflow manager |
| Conda / Mamba | Any | For Python environments |
| Singularity | ≥ 3.x | For R and specialist containers |
| SLURM | Any | Job scheduler |
| Python | 3.12 | Within the Conda environment |
The pipeline has been developed and tested on the Hawk HPC system (Cardiff University, ARCCA), using the c_compute_neuro1 partition. Resource parameters in config/profile/config.yaml may need adjustment for other systems.
Obtaining the Data
Raw Sequencing Data (snRNA-seq)
Raw FASTQ files are stored on restricted institutional storage and are not publicly available due to donor consent constraints. Access to the raw data can be requested by contacting Nick Bray.
The snRNA-seq data were generated using the Parse Biosciences split-pool ligation-based combinatorial indexing (SPLiT-seq) protocol across three sequencing plates:
| Plate | Sublibraries | Approx. Size |
|---|---|---|
| Plate 1 | 15 | ~1 TB |
| Plate 2 | 16 | ~1 TB |
| Plate 3 | 12 | ~1 TB |
Raw FASTQs consist of 4 lanes × multiple sequencing runs × 2 reads (R1, R2) per sublibrary. Before running the pipeline, the FASTQ manifest JSONs must be generated (see Configuring FASTQ Manifests below).
Genotype Data
Raw genotype data (PLINK format, hg19) were generated by genome-wide SNP array genotyping and are subject to the same access restrictions as the sequencing data. Post-imputation VCFs (TOPMED imputation server, hg38) are also available on request.
Public Datasets
All public datasets used in the pipeline are downloaded automatically by Snakemake rules. URLs are specified in config/config.yaml:
GWAS summary statistics (downloaded by pipeline 07-prep-GWAS):
| Trait | Source | PMID |
|---|---|---|
| Schizophrenia | Figshare | 35396580 |
| Bipolar Disorder | Figshare | 39843750 |
| Major Depressive Disorder | Figshare | 39814019 |
| ADHD | Figshare | 36702997 |
| OCD | Figshare | 40360802 |
eQTL replication datasets (downloaded by pipeline 06-qtl-replication):
- Bryois et al. 2022 — adult single-cell eQTLs
- Ziffra et al. 2021 — fetal snATAC-seq peaks
- Wen et al. 2024 — developmental bulk brain eQTLs
- O’Brien et al. 2018 — adult bulk brain eQTLs
Reference files (downloaded by pipeline 01-parse):
- Ensembl GRCh38 release 113 FASTA and GTF (for Parse alignment reference)
- dbSNP build 156 VCF (for rsID annotation of imputed genotypes)
- S-LDSR hg38 baseline v1.2 reference files (1000 Genomes, HapMap3)
Cloning the Repository
git clone https://github.com/Dazcam/eQTL_study_2025.git
cd eQTL_study_2025Configuring the Pipeline
All configuration is centralised in config/config.yaml. Before running, update the following sections:
Root Directory
Set root_dir to the absolute path of your working directory on scratch storage:
root_dir: /path/to/your/scratch/eQTL_study_2025/Container Paths
Update the containers block to point to your local Singularity image files (.sif). The pipeline uses eight containers:
containers:
tensorqtl: /path/to/containers/tensorqtl.sif
genotype-qc2hrc: /path/to/containers/genotype-qc2hrc.sif
r_eqtl: /path/to/containers/r_eqtl.sif
susie: /path/to/containers/susier.sif
ubuntu: /path/to/containers/ubuntu.sif
twas: /path/to/containers/twas.sifContainers can be built from their respective definition files (available on request) or pulled from the project’s container registry.
Cell Types
The cell_types list controls which cell type-specific eQTL analyses are run. By default, 7 broad and 12 subtype labels are included. To run a subset, comment out entries as needed:
cell_types:
- "Glu-UL"
- "Glu-DL"
- "NPC"
- "GABA"
- "Endo-Peri"
- "OPC"
- "MG"
# subtypes below
- "Glu-UL-0"
# ...SLURM Profile
The cluster submission profile is in config/profile/config.yaml. Update the --account and --partition fields for your HPC system:
cluster-generic-submit-cmd: >
sbatch
--account=YOUR_ACCOUNT
--partition=YOUR_PARTITION
...Default resource allocations are set under default-resources and can be overridden per rule within each .smk file.
Configuring FASTQ Manifests
The pipeline uses JSON manifests to map sample IDs to their FASTQ file paths. These must be generated before running pipeline 01-parse.
Use the provided script to crawl your FASTQ directories and produce a manifest for each plate:
python workflow/scripts/create_parse_json.py \
--fastq_dirs /path/to/plate1_run1 /path/to/plate1_run2 \
--plate plate1This produces samples_plate1.json with the structure:
{
"10_plate1": {
"R1": ["/path/to/10_S8_L001_R1_001.fastq.gz", ...],
"R2": ["/path/to/10_S8_L001_R2_001.fastq.gz", ...]
},
...
}The script automatically:
- Walks all provided directories recursively
- Matches files using the Parse filename convention (
{sample}_S{n}_{lane}_{read}_001.fastq.gz) - Sorts R1/R2 files consistently across lanes and runs
- Appends the plate identifier to each sample name to ensure uniqueness across plates
Repeat for each plate and update MERGE_FQ_JSON in config/config.yaml to point to the relevant manifest before running.
A second JSON (config/bam_files.json) maps sample IDs to their processed BAM file paths for genotype-aware steps. This is generated after pipeline 01-parse completes using workflow/scripts/create_bam_json.py.
Installing Conda Environments
The Python-based Scanpy pipeline uses a pinned Conda environment. Install it with:
conda env create -f workflow/envs/eqtl_study.ymlThis environment includes Scanpy 1.10, Jupyter, Papermill (for notebook execution), doublet detection tools (Scrublet, DoubletDetection), and all dependencies pinned to exact build hashes for full reproducibility.
Two additional lightweight environments are required for genotype-aware steps:
conda env create -f workflow/envs/cellsnp_lite.yml
conda env create -f workflow/envs/vireo.ymlSnakemake will activate the correct environment automatically for each rule when running with --use-conda. Manual activation is only needed for interactive development.
Running the Pipeline
From the workflow/ directory:
bash snakemake.shThis invokes Snakemake with the SLURM cluster profile, captures the full run log to a date-stamped file, and sends an email on completion.
To run a dry-run first (recommended for new configurations):
snakemake --profile ../config/profile/ -n --quietTo run a specific pipeline module only, use the rule name or target output file:
snakemake --profile ../config/profile/ results/02SCANPY/scanpy_clustering.htmlEnsure root_dir in config/config.yaml points to a scratch filesystem with sufficient space (~20 TB for the full run). Home directories on most HPC systems will not have sufficient quota.
Directory Structure
eQTL_study_2025/
├── config/
│ ├── config.yaml # Master configuration
│ ├── profile/config.yaml # SLURM cluster profile
│ ├── samples_plate{1,3}.json # FASTQ manifests (per plate)
│ └── bam_files.json # BAM path manifest
├── workflow/
│ ├── Snakefile # Top-level workflow entry point
│ ├── rules/ # 13 modular rule files (one per pipeline)
│ ├── scripts/ # Python and R analysis scripts
│ ├── envs/ # Pinned Conda environment YAMLs
│ └── snakemake.sh # Launch script
├── pipelines/ # Quarto documentation pages
├── _quarto.yml # Quarto site configuration
└── .github/workflows/ # GitHub Actions CI/CD (auto-publish docs)