Project Setup - Beta

This page covers everything needed to configure and run the pipeline from scratch: obtaining the data, installing software dependencies, and configuring the environment.

Prerequisites

The pipeline is designed to run on a SLURM HPC cluster with access to Conda and Singularity. The following must be available on your system before proceeding:

Requirement	Version	Notes
Snakemake	≥ 8.0	Workflow manager
Conda / Mamba	Any	For Python environments
Singularity	≥ 3.x	For R and specialist containers
SLURM	Any	Job scheduler
Python	3.12	Within the Conda environment

Note

The pipeline has been developed and tested on the Hawk HPC system (Cardiff University, ARCCA), using the c_compute_neuro1 partition. Resource parameters in config/profile/config.yaml may need adjustment for other systems.

Obtaining the Data

Raw Sequencing Data (snRNA-seq)

Raw FASTQ files are stored on restricted institutional storage and are not publicly available due to donor consent constraints. Access to the raw data can be requested by contacting Nick Bray.

The snRNA-seq data were generated using the Parse Biosciences split-pool ligation-based combinatorial indexing (SPLiT-seq) protocol across three sequencing plates:

Plate	Sublibraries	Approx. Size
Plate 1	15	~1 TB
Plate 2	16	~1 TB
Plate 3	12	~1 TB

Raw FASTQs consist of 4 lanes × multiple sequencing runs × 2 reads (R1, R2) per sublibrary. Before running the pipeline, the FASTQ manifest JSONs must be generated (see Configuring FASTQ Manifests below).

Genotype Data

Raw genotype data (PLINK format, hg19) were generated by genome-wide SNP array genotyping and are subject to the same access restrictions as the sequencing data. Post-imputation VCFs (TOPMED imputation server, hg38) are also available on request.

Public Datasets

All public datasets used in the pipeline are downloaded automatically by Snakemake rules. URLs are specified in config/config.yaml:

GWAS summary statistics (downloaded by pipeline 07-prep-GWAS):

Trait	Source	PMID
Schizophrenia	Figshare	35396580
Bipolar Disorder	Figshare	39843750
Major Depressive Disorder	Figshare	39814019
ADHD	Figshare	36702997
OCD	Figshare	40360802

eQTL replication datasets (downloaded by pipeline 06-qtl-replication):

Bryois et al. 2022 — adult single-cell eQTLs
Ziffra et al. 2021 — fetal snATAC-seq peaks
Wen et al. 2024 — developmental bulk brain eQTLs
O’Brien et al. 2018 — adult bulk brain eQTLs

Reference files (downloaded by pipeline 01-parse):

Ensembl GRCh38 release 113 FASTA and GTF (for Parse alignment reference)
dbSNP build 156 VCF (for rsID annotation of imputed genotypes)
S-LDSR hg38 baseline v1.2 reference files (1000 Genomes, HapMap3)

Cloning the Repository

git clone https://github.com/Dazcam/eQTL_study_2025.git
cd eQTL_study_2025

Configuring the Pipeline

All configuration is centralised in config/config.yaml. Before running, update the following sections:

Root Directory

Set root_dir to the absolute path of your working directory on scratch storage:

root_dir: /path/to/your/scratch/eQTL_study_2025/

Container Paths

Update the containers block to point to your local Singularity image files (.sif). The pipeline uses eight containers:

containers:
  tensorqtl: /path/to/containers/tensorqtl.sif
  genotype-qc2hrc: /path/to/containers/genotype-qc2hrc.sif
  r_eqtl: /path/to/containers/r_eqtl.sif
  susie: /path/to/containers/susier.sif
  ubuntu: /path/to/containers/ubuntu.sif
  twas: /path/to/containers/twas.sif

Containers can be built from their respective definition files (available on request) or pulled from the project’s container registry.

Cell Types

The cell_types list controls which cell type-specific eQTL analyses are run. By default, 7 broad and 12 subtype labels are included. To run a subset, comment out entries as needed:

cell_types:
  - "Glu-UL"
  - "Glu-DL"
  - "NPC"
  - "GABA"
  - "Endo-Peri"
  - "OPC"
  - "MG"
  # subtypes below
  - "Glu-UL-0"
  # ...

SLURM Profile

The cluster submission profile is in config/profile/config.yaml. Update the --account and --partition fields for your HPC system:

cluster-generic-submit-cmd: >
  sbatch
    --account=YOUR_ACCOUNT
    --partition=YOUR_PARTITION
    ...

Default resource allocations are set under default-resources and can be overridden per rule within each .smk file.

Configuring FASTQ Manifests

The pipeline uses JSON manifests to map sample IDs to their FASTQ file paths. These must be generated before running pipeline 01-parse.

Use the provided script to crawl your FASTQ directories and produce a manifest for each plate:

python workflow/scripts/create_parse_json.py \
  --fastq_dirs /path/to/plate1_run1 /path/to/plate1_run2 \
  --plate plate1

This produces samples_plate1.json with the structure:

{
    "10_plate1": {
        "R1": ["/path/to/10_S8_L001_R1_001.fastq.gz", ...],
        "R2": ["/path/to/10_S8_L001_R2_001.fastq.gz", ...]
    },
    ...
}

The script automatically:

Walks all provided directories recursively
Matches files using the Parse filename convention ({sample}_S{n}_{lane}_{read}_001.fastq.gz)
Sorts R1/R2 files consistently across lanes and runs
Appends the plate identifier to each sample name to ensure uniqueness across plates

Repeat for each plate and update MERGE_FQ_JSON in config/config.yaml to point to the relevant manifest before running.

A second JSON (config/bam_files.json) maps sample IDs to their processed BAM file paths for genotype-aware steps. This is generated after pipeline 01-parse completes using workflow/scripts/create_bam_json.py.

Installing Conda Environments

The Python-based Scanpy pipeline uses a pinned Conda environment. Install it with:

conda env create -f workflow/envs/eqtl_study.yml

This environment includes Scanpy 1.10, Jupyter, Papermill (for notebook execution), doublet detection tools (Scrublet, DoubletDetection), and all dependencies pinned to exact build hashes for full reproducibility.

Two additional lightweight environments are required for genotype-aware steps:

conda env create -f workflow/envs/cellsnp_lite.yml
conda env create -f workflow/envs/vireo.yml

Note

Snakemake will activate the correct environment automatically for each rule when running with --use-conda. Manual activation is only needed for interactive development.

Running the Pipeline

From the workflow/ directory:

bash snakemake.sh

This invokes Snakemake with the SLURM cluster profile, captures the full run log to a date-stamped file, and sends an email on completion.

To run a dry-run first (recommended for new configurations):

snakemake --profile ../config/profile/ -n --quiet

To run a specific pipeline module only, use the rule name or target output file:

snakemake --profile ../config/profile/ results/02SCANPY/scanpy_clustering.html

Warning

Ensure root_dir in config/config.yaml points to a scratch filesystem with sufficient space (~20 TB for the full run). Home directories on most HPC systems will not have sufficient quota.

Directory Structure

eQTL_study_2025/
├── config/
│   ├── config.yaml              # Master configuration
│   ├── profile/config.yaml      # SLURM cluster profile
│   ├── samples_plate{1,3}.json  # FASTQ manifests (per plate)
│   └── bam_files.json           # BAM path manifest
├── workflow/
│   ├── Snakefile                # Top-level workflow entry point
│   ├── rules/                   # 13 modular rule files (one per pipeline)
│   ├── scripts/                 # Python and R analysis scripts
│   ├── envs/                    # Pinned Conda environment YAMLs
│   └── snakemake.sh             # Launch script
├── pipelines/                   # Quarto documentation pages
├── _quarto.yml                  # Quarto site configuration
└── .github/workflows/           # GitHub Actions CI/CD (auto-publish docs)