07 · Prepare GWAS Summary Statistics

Overview

This pipeline downloads, standardises, and processes GWAS summary statistics for six neuropsychiatric traits into formats required by S-LDSR, SMR, and cTWAS. Steps include format normalisation, Z-score generation and sample size addition, coordinate liftover from hg19 to hg38, and final munging for LDSR compatibility.


Workflow Logic

graph TD
    classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
    classDef out fill:#e1f5fe,stroke:#01579b,color:#000;

    A[PGC / Figshare URLs] --> B(get_gwas_sumstats)
    B --> C(standardise_sumstats)
    C --> D(add_z_score)
    D --> E(add_N)
    E --> F(make_gwas_bed_hg19)
    F --> G(liftover_gwas_to_hg38)
    E --> H(add_hg38_coords_to_gwas)
    G --> H
    H --> I(munge_sumstats)
    I --> J[(LDSR-ready .sumstats.gz)]
    H --> K[(hg38 TSV for SMR/cTWAS)]

    class A,B,C,D,E,F,G,H,I snazzy
    class J,K out

  1. get_gwas_sumstats: Downloads raw summary statistics from Figshare/PGC. Each GWAS has bespoke download logic (direct gunzip, or nested unzip for ADHD/PTSD/OCD archives).
  2. standardise_sumstats: Normalises column names to a consistent schema (SNP, CHR, BP, PVAL, A1, A2, Z) using python_convert/sumstats.py, with GWAS-specific column renames handled by run: blocks.
  3. add_z_score: Adds a Z-score column to summary statistics that lack it.
  4. add_N: Adds a total sample size column (N = NCAS + NCON for case/control studies).
  5. make_gwas_bed_hg19: Creates a BED file (hg19 coordinates) for LiftOver input.
  6. liftover_gwas_to_hg38: Lifts SNP coordinates from hg19 to hg38 using the UCSC LiftOver binary and chain file.
  7. add_hg38_coords_to_gwas: Merges hg38 coordinates back into the full summary statistics table using add_hg38_coords_to_gwas.py.
  8. munge_sumstats: Runs LDSR munge_sumstats.py to produce final .sumstats.gz files compatible with stratified LD score regression.

GWAS Traits

Trait Source PMID Notes
Schizophrenia (SCZ) PGC Figshare 35396580 EUR only, PGC3 wave 3
Bipolar Disorder (BPD) PGC Figshare 39843750 2024 release
Major Depressive Disorder (MDD) PGC Figshare 39814019 2025 release, no 23andMe/UKBB
ADHD Figshare 36702997 iPSYCH + deCODE + PGC
OCD Figshare 40360802 no 23andMe

Download URLs are specified in config/config.yaml under the gwas: block and are passed as Snakemake wildcards, so adding a new GWAS requires only a new URL entry.


Note

LDSR environment activation

The munge_sumstats rule requires the ldsr conda environment, which contains a Python 2.7 installation needed by the original LDSC software. This is activated inside the shell block using an eval hook to avoid conflicts with the Snakemake environment:

eval "$(/apps/languages/miniforge3/24.3.0-0/bin/conda shell.bash hook)"
conda activate ldsr

Note

GWAS-specific standardisation

Because each PGC consortium uses different column naming conventions, the standardise_sumstats and add_N rules use Python run: blocks with if/elif branching per GWAS wildcard value to apply the correct column renames before passing to python_convert.


Note

Singularity container

# From repo root directory
singularity pull ubuntu_22.04.sif docker://ubuntu:22.04
mv ubuntu_22.04.sif to resources/containers/

Technical Requirements

Category Detail
Key tools UCSC LiftOver binary, LDSC munge_sumstats.py, python_convert
Container ubuntu_22.04.sif (LiftOver)
Environment ldsr conda (munge step only)
Reference hg19→hg38 UCSC chain file; HapMap3 SNP list (hg38)
Input Raw GWAS summary stats (downloaded by pipeline)
Output {gwas}_hg38.tsv (for SMR/cTWAS); {gwas}_hg38_ldsr_ready.sumstats.gz (for LDSR)

Resource Profile

All rules are lightweight. The most resource-intensive step is add_hg38_coords_to_gwas, which joins large TSVs in Python:

Rule Threads RAM Walltime
Most rules 1 5 GB <30 min
add_hg38_coords_to_gwas 5 20 GB 5h
Back to top