03 · Genotype QC pre-imputation

Overview

This pipeline prepares raw SNP array genotyping data (PLINK PED/MAP format, hg19) for submission to the TOPMed Imputation Server. It covers format conversion, sample-level QC (sex assignment, heterozygosity check, duplicate removal), SNP-level QC (strand ambiguity, MAF filtering), ancestry inference using 1000 Genomes Project reference data, coordinate liftover to hg38, and chromosome-split VCF output ready for imputation upload.

Workflow Architecture

GenotypeQCtoHRC R suite.

graph TD
    classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
    classDef out fill:#e1f5fe,stroke:#01579b,color:#000;

    A[Raw PED/MAP hg19] --> B(convert_raw)
    B --> C(add_sex_to_fam)
    C --> D(rm_dup_sample)
    D --> E(check_het)
    D --> F(make_kgp3_pgen)
    F --> G(genotype_qc2hrc)
    D --> G
    G --> H(cat_genotypes)
    H --> I(rm_maf_ambig)
    I --> J(rm_rare_snps)
    J --> K(split_chrs)
    K --> L(gather_stats)
    L --> M(geno_pre_report)
    K --> N[(Per-chr VCFs for TOPMED)]

    class A,B,C,D,E,F,G,H,I,J,K,L,M snazzy
    class N out

convert_raw: Converts PED/MAP to PLINK binary format (BED/BIM/FAM) and generates allele frequency report.
add_sex_to_fam: Standardises FAM file and updates sex annotations from a phenotype sheet.
rm_dup_sample: Removes a known duplicate sample (ID 14493) identified during QC.
check_het: Computes autosomal heterozygosity on LD-pruned SNPs for outlier detection.
make_kgp3_pgen: Prepares the 1000 Genomes Project Phase 3 reference panel in PLINK2 PGEN format for ancestry inference.
genotype_qc2hrc: Runs the GenotypeQCtoHRC R pipeline (Singularity container). Performs comprehensive sample and SNP QC, ancestry inference, strand alignment to TOPMed reference, and liftover from hg19 → hg38. Outputs per-chromosome VCFs.
cat_genotypes: Concatenates the 22 per-chromosome VCFs from GenotypeQCtoHRC into a single genome-wide VCF.
rm_maf_ambig: Removes strand-ambiguous SNPs (A/T or C/G) with MAF > 0.4, where strand cannot be reliably resolved.
rm_rare_snps: Removes SNPs with MAF < 0.01.
split_chrs: Converts to VCF and splits by chromosome (chr1–22) for TOPMED upload.
gather_stats: Counts SNPs per chromosome at the GenotypeQCtoHRC and final stages for QC reporting.
geno_pre_report: Renders an RMarkdown HTML QC report summarising all filtering steps.

Note

GenotypeQCtoHRC working directory

The GenotypeQCtoHRC package uses R targets internally and generates all outputs relative to its own repository root. The genotype_qc2hrc rule therefore cds into the local clone before calling GenotypeQCtoHRC.R:

(cd {params.workdir} && Rscript GenotypeQCtoHRC.R \
  --file {params.prefix_in} \
  --gh TRUE --gh-ref TopMed \
  --lo TRUE --lo-in 37 --lo-out 38 \
  --clean TRUE)

Clone the GenotypeQCtoHRC repository into results/03GENOTYPES-PRE/GenotypeQCtoHRC/ before running this pipeline. See the GenotypeQCtoHRC documentation for required reference files.

Technical Requirements

Category	Detail
Primary tools	PLINK 1.9, PLINK 2.0, BCFtools 1.16
Container	`genotype-qc2hrc_latest.sif`, `seurat5f.sif` (R; for report)
Env modules	`plink/1.9`, `plink/2.0`, `bcftools/1.16.0`
Input	Raw PED/MAP genotype files (hg19)
Output	Per-chromosome hg38 VCFs ready for TOPMed imputation

Resource Profile

Rule	Threads	RAM	Walltime	Notes
Most rules	1	50 GB	1h	Minimal requirements
`genotype_qc2hrc`	10	100 GB	5h	Full ancestry QC + liftover