Project Overview - Beta

A single-cell eQTL atlas of the developing human brain

We performed single-nucleus RNA sequencing and genome-wide genotyping on cerebral cortex from 134 unrelated samples (second trimester) to generate the first cell-type-resolved eQTL atlas of the prenatal human brain.

This site is the documetation for an end-to-end computational genomics platform to process ~3 TB of raw single-nucleus RNA sequencing and genome-wide genotyping data. The pipeline identifies genetic variants that influence gene expression in specific brain cell types during development, and links those variants to neuropsychiatric disease risk.

NOTE: the documentation and eQTL browser app are currently in beta.

eQTL pipeline overview

Data Engineering Architecture

This project processes ~3 TB of raw genomic data through a suite of 13 interoperable Snakemake pipelines, orchestrated on a SLURM HPC cluster. The platform was designed from the ground up for scalability, reproducibility, and collaborative reuse.

Workflow Orchestration

All pipelines are managed by Snakemake (v8.x) with a dedicated SLURM cluster profile, allowing up to 500 concurrent jobs across a dedicated compute partition (c_compute_neuro1, account scw1641). The profile handles automatic job submission, logging, and resource allocation per rule:

# config/profile/config.yaml (excerpt)
executor: cluster-generic
jobs: 500
use-conda: true
use-singularity: true
cluster-generic-submit-cmd: >
  sbatch
    --ntasks={resources.ntasks}
    --mem={resources.mem_mb}
    --time={resources.time}
    --cpus-per-task={resources.threads}
    --account=scw1641
    --partition=c_compute_neuro1

The pipeline is launched via a single shell script that captures the full Snakemake log and emails on completion:

# workflow/snakemake.sh
snakemake --profile ../config/profile/ $@ 2> smk-"`date +"%d-%m-%Y"`.log
mail -s "Snakemake has finished" camerond@cardiff.ac.uk < smk-"`date +"%d-%m-%Y"`.log

Centralised Configuration

All parameters, file paths, tool settings, container paths, GWAS URLs, and cell-type lists are managed in a single config/config.yaml. This means the entire platform can be reconfigured for a new dataset by editing one file — no hardcoded paths in any script or rule.

Key config-driven components include:

Cell types (20 entries: 7 broad + 12 subtypes) — propagated automatically to TensorQTL, SuSiE, S-LDSR, SMR, and TWAS rules via Snakemake wildcards
GWAS URLs — six neuropsychiatric GWAS (SCZ, BPD, MDD, ADHD, OCD) downloaded directly from Figshare/PGC by the pipeline
Container paths — eight Singularity containers mapped centrally, ensuring every rule uses the correct software environment
Analysis parameters — SuSiE window (1 Mb), batch count (25), FDR threshold (0.05), TensorQTL permutation bounds, SMR windows, all in one place

Data Ingestion: JSON-driven FASTQ Management

Raw sequencing data from three plates (150 samples, 43 sublibraries) are spread across multiple sequencing runs and lanes on network-attached storage. Rather than hardcoding paths, we use a custom Python script (workflow/scripts/create_parse_json.py) to crawl FASTQ directories, match files by sample and read orientation using regex, sort across lanes and runs, and serialise the result to a JSON manifest:

# Matches filename pattern: 10_S8_L001_R1_001.fastq.gz
m = re.search(r'(\d+)_S\d+_(L\d{3})_(R\d)', file)
if m:
    sample, lane, reads = m.group(1), m.group(2), m.group(3)
    FILES[sample][reads].append(full_path)

The resulting JSON (e.g. config/samples_plate3.json) maps each sample ID to its full list of R1 and R2 FASTQ paths across all lanes and runs. Snakemake ingests this at runtime via a lambda wildcard function, allowing it to concatenate the correct files for each sample dynamically:

rule cat_fqs:
    input:
        r1 = lambda wildcards: MERGE_FQ[wildcards.sample]['R1'],
        r2 = lambda wildcards: MERGE_FQ[wildcards.sample]['R2']

A parallel JSON (config/bam_files.json) maps sample IDs to their processed BAM file paths for downstream genotype-aware steps (cellSNP-lite, Vireo donor deconvolution).

Reproducible Environments

Software reproducibility is enforced at two layers:

Conda environments with fully pinned dependencies are used for Python-based pipelines (Scanpy, TensorQTL). The eqtl_study.yml environment pins every package to an exact build hash, ensuring bit-for-bit reproducibility across HPC nodes and future reruns.

Singularity containers are used for R-based and specialist tools where conda environments are insufficient. Eight containers are defined in config.yaml:

Container	Purpose
`tensorqtl.sif`	GPU-accelerated eQTL mapping (PyTorch)
`r_eqtl.sif`	Core R analysis environment
`susier_v24.01.1.sif`	SuSiE fine-mapping
`seurat5f.sif`	Seurat 5 / general R
`twas.sif`	FUSION TWAS weight computation
`gtex_eqtl.sif`	FastQTL / GTEx tools
`genotype-qc2hrc_latest.sif`	Genotype QC and TOPMED imputation prep
`ubuntu_22.04.sif`	Lightweight shell utility container

GPU-accelerated eQTL Mapping

TensorQTL (PyTorch backend) is used for cis eQTL mapping, enabling GPU-parallelised permutation testing across all 19 cell types. Four mapping modes are run per cell type: nominal, permutation, independent, and trans. Output is stored in Parquet format for efficient downstream parsing.

Automated Documentation

Pipeline documentation (this site) is built with Quarto and published automatically to GitHub Pages via a GitHub Actions workflow on every push to main:

# .github/workflows/publish.yml
- name: Render and Publish
  uses: quarto-dev/quarto-actions/publish@v2
  with:
    target: gh-pages

Pipeline Overview

The 13 pipelines run in a defined order, passing outputs directly between stages:

#	Pipeline	Input	Output
01	Parse Alignment	Raw FASTQs	Per-plate DGE matrices (H5AD)
02	Scanpy	H5AD matrices	Annotated clusters, pseudobulk counts
03	Genotypes (pre-imputation)	Raw PLINK files (hg19)	TOPMED-ready VCFs (hg38)
04	Genotypes (post-imputation)	Imputed VCFs	Filtered, annotated, indexed VCF
05	TensorQTL	Pseudobulk counts + VCF	cis eQTL results (nominal, perm, indep, trans)
06	eQTL Replication	eQTL results	π₁ enrichment vs. 4 public datasets
07	Prep GWAS	PGC summary statistics	Munged, lifted, harmonised GWAS files
08	SuSiE Fine-mapping	eQTL + VCF	Credible sets, MaxCPP / CS95 annotations
09	S-LDSR	Fine-mapped eQTLs + GWAS	Partitioned heritability results
10	SMR	eQTL summary stats + GWAS	Colocalisation results
11	TWAS Weights	Pseudobulk counts + genotypes	FUSION weight files per gene/cell type
12	cTWAS	TWAS weights + GWAS	Causal TWAS results
13	Visualisation	All upstream results	Manuscript figures and tables

Repository

Full source code, configuration, and documentation are available at: github.com/Dazcam/eQTL_study_2025