07 · Prepare GWAS Summary Statistics

Overview

This pipeline downloads, standardises, and processes GWAS summary statistics for six neuropsychiatric traits into formats required by S-LDSR, SMR, and cTWAS. Steps include format normalisation, Z-score generation and sample size addition, coordinate liftover from hg19 to hg38, and final munging for LDSR compatibility.

Workflow Logic

graph TD
    classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
    classDef out fill:#e1f5fe,stroke:#01579b,color:#000;

    A[PGC / Figshare URLs] --> B(get_gwas_sumstats)
    B --> C(standardise_sumstats)
    C --> D(add_z_score)
    D --> E(add_N)
    E --> F(make_gwas_bed_hg19)
    F --> G(liftover_gwas_to_hg38)
    E --> H(add_hg38_coords_to_gwas)
    G --> H
    H --> I(munge_sumstats)
    I --> J[(LDSR-ready .sumstats.gz)]
    H --> K[(hg38 TSV for SMR/cTWAS)]

    class A,B,C,D,E,F,G,H,I snazzy
    class J,K out

get_gwas_sumstats: Downloads raw summary statistics from Figshare/PGC. Each GWAS has bespoke download logic (direct gunzip, or nested unzip for ADHD/PTSD/OCD archives).
standardise_sumstats: Normalises column names to a consistent schema (SNP, CHR, BP, PVAL, A1, A2, Z) using python_convert/sumstats.py, with GWAS-specific column renames handled by run: blocks.
add_z_score: Adds a Z-score column to summary statistics that lack it.
add_N: Adds a total sample size column (N = NCAS + NCON for case/control studies).
make_gwas_bed_hg19: Creates a BED file (hg19 coordinates) for LiftOver input.
liftover_gwas_to_hg38: Lifts SNP coordinates from hg19 to hg38 using the UCSC LiftOver binary and chain file.
add_hg38_coords_to_gwas: Merges hg38 coordinates back into the full summary statistics table using add_hg38_coords_to_gwas.py.
munge_sumstats: Runs LDSR munge_sumstats.py to produce final .sumstats.gz files compatible with stratified LD score regression.

GWAS Traits

Trait	Source	PMID	Notes
Schizophrenia (SCZ)	PGC Figshare	35396580	EUR only, PGC3 wave 3
Bipolar Disorder (BPD)	PGC Figshare	39843750	2024 release
Major Depressive Disorder (MDD)	PGC Figshare	39814019	2025 release, no 23andMe/UKBB
ADHD	Figshare	36702997	iPSYCH + deCODE + PGC
OCD	Figshare	40360802	no 23andMe

Download URLs are specified in config/config.yaml under the gwas: block and are passed as Snakemake wildcards, so adding a new GWAS requires only a new URL entry.

Note

LDSR environment activation

The munge_sumstats rule requires the ldsr conda environment, which contains a Python 2.7 installation needed by the original LDSC software. This is activated inside the shell block using an eval hook to avoid conflicts with the Snakemake environment:

eval "$(/apps/languages/miniforge3/24.3.0-0/bin/conda shell.bash hook)"
conda activate ldsr

Note

GWAS-specific standardisation

Because each PGC consortium uses different column naming conventions, the standardise_sumstats and add_N rules use Python run: blocks with if/elif branching per GWAS wildcard value to apply the correct column renames before passing to python_convert.

Note

Singularity container

# From repo root directory
singularity pull ubuntu_22.04.sif docker://ubuntu:22.04
mv ubuntu_22.04.sif to resources/containers/

Technical Requirements

Category	Detail
Key tools	UCSC LiftOver binary, LDSC `munge_sumstats.py`, `python_convert`
Container	`ubuntu_22.04.sif` (LiftOver)
Environment	`ldsr` conda (munge step only)
Reference	hg19→hg38 UCSC chain file; HapMap3 SNP list (hg38)
Input	Raw GWAS summary stats (downloaded by pipeline)
Output	`{gwas}_hg38.tsv` (for SMR/cTWAS); `{gwas}_hg38_ldsr_ready.sumstats.gz` (for LDSR)

Resource Profile

All rules are lightweight. The most resource-intensive step is add_hg38_coords_to_gwas, which joins large TSVs in Python:

Rule	Threads	RAM	Walltime
Most rules	1	5 GB	<30 min
`add_hg38_coords_to_gwas`	5	20 GB	5h