graph TD
classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
classDef out fill:#e1f5fe,stroke:#01579b,color:#000;
A[PGC / Figshare URLs] --> B(get_gwas_sumstats)
B --> C(standardise_sumstats)
C --> D(add_z_score)
D --> E(add_N)
E --> F(make_gwas_bed_hg19)
F --> G(liftover_gwas_to_hg38)
E --> H(add_hg38_coords_to_gwas)
G --> H
H --> I(munge_sumstats)
I --> J[(LDSR-ready .sumstats.gz)]
H --> K[(hg38 TSV for SMR/cTWAS)]
class A,B,C,D,E,F,G,H,I snazzy
class J,K out
07 · Prepare GWAS Summary Statistics
Overview
This pipeline downloads, standardises, and processes GWAS summary statistics for six neuropsychiatric traits into formats required by S-LDSR, SMR, and cTWAS. Steps include format normalisation, Z-score generation and sample size addition, coordinate liftover from hg19 to hg38, and final munging for LDSR compatibility.
Workflow Logic
get_gwas_sumstats: Downloads raw summary statistics from Figshare/PGC. Each GWAS has bespoke download logic (direct gunzip, or nested unzip for ADHD/PTSD/OCD archives).standardise_sumstats: Normalises column names to a consistent schema (SNP, CHR, BP, PVAL, A1, A2, Z) usingpython_convert/sumstats.py, with GWAS-specific column renames handled byrun:blocks.add_z_score: Adds a Z-score column to summary statistics that lack it.add_N: Adds a total sample size column (N = NCAS + NCON for case/control studies).make_gwas_bed_hg19: Creates a BED file (hg19 coordinates) for LiftOver input.liftover_gwas_to_hg38: Lifts SNP coordinates from hg19 to hg38 using the UCSC LiftOver binary and chain file.add_hg38_coords_to_gwas: Merges hg38 coordinates back into the full summary statistics table usingadd_hg38_coords_to_gwas.py.munge_sumstats: Runs LDSRmunge_sumstats.pyto produce final.sumstats.gzfiles compatible with stratified LD score regression.
GWAS Traits
| Trait | Source | PMID | Notes |
|---|---|---|---|
| Schizophrenia (SCZ) | PGC Figshare | 35396580 | EUR only, PGC3 wave 3 |
| Bipolar Disorder (BPD) | PGC Figshare | 39843750 | 2024 release |
| Major Depressive Disorder (MDD) | PGC Figshare | 39814019 | 2025 release, no 23andMe/UKBB |
| ADHD | Figshare | 36702997 | iPSYCH + deCODE + PGC |
| OCD | Figshare | 40360802 | no 23andMe |
Download URLs are specified in config/config.yaml under the gwas: block and are passed as Snakemake wildcards, so adding a new GWAS requires only a new URL entry.
LDSR environment activation
The munge_sumstats rule requires the ldsr conda environment, which contains a Python 2.7 installation needed by the original LDSC software. This is activated inside the shell block using an eval hook to avoid conflicts with the Snakemake environment:
eval "$(/apps/languages/miniforge3/24.3.0-0/bin/conda shell.bash hook)"
conda activate ldsrGWAS-specific standardisation
Because each PGC consortium uses different column naming conventions, the standardise_sumstats and add_N rules use Python run: blocks with if/elif branching per GWAS wildcard value to apply the correct column renames before passing to python_convert.
Singularity container
# From repo root directory
singularity pull ubuntu_22.04.sif docker://ubuntu:22.04
mv ubuntu_22.04.sif to resources/containers/Technical Requirements
| Category | Detail |
|---|---|
| Key tools | UCSC LiftOver binary, LDSC munge_sumstats.py, python_convert |
| Container | ubuntu_22.04.sif (LiftOver) |
| Environment | ldsr conda (munge step only) |
| Reference | hg19→hg38 UCSC chain file; HapMap3 SNP list (hg38) |
| Input | Raw GWAS summary stats (downloaded by pipeline) |
| Output | {gwas}_hg38.tsv (for SMR/cTWAS); {gwas}_hg38_ldsr_ready.sumstats.gz (for LDSR) |
Resource Profile
All rules are lightweight. The most resource-intensive step is add_hg38_coords_to_gwas, which joins large TSVs in Python:
| Rule | Threads | RAM | Walltime |
|---|---|---|---|
| Most rules | 1 | 5 GB | <30 min |
add_hg38_coords_to_gwas |
5 | 20 GB | 5h |