graph TD
classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
classDef out fill:#e1f5fe,stroke:#01579b,color:#000;
A[Raw PED/MAP hg19] --> B(convert_raw)
B --> C(add_sex_to_fam)
C --> D(rm_dup_sample)
D --> E(check_het)
D --> F(make_kgp3_pgen)
F --> G(genotype_qc2hrc)
D --> G
G --> H(cat_genotypes)
H --> I(rm_maf_ambig)
I --> J(rm_rare_snps)
J --> K(split_chrs)
K --> L(gather_stats)
L --> M(geno_pre_report)
K --> N[(Per-chr VCFs for TOPMED)]
class A,B,C,D,E,F,G,H,I,J,K,L,M snazzy
class N out
03 · Genotype QC pre-imputation
Overview
This pipeline prepares raw SNP array genotyping data (PLINK PED/MAP format, hg19) for submission to the TOPMed Imputation Server. It covers format conversion, sample-level QC (sex assignment, heterozygosity check, duplicate removal), SNP-level QC (strand ambiguity, MAF filtering), ancestry inference using 1000 Genomes Project reference data, coordinate liftover to hg38, and chromosome-split VCF output ready for imputation upload.
Workflow Architecture
GenotypeQCtoHRC R suite.
convert_raw: Converts PED/MAP to PLINK binary format (BED/BIM/FAM) and generates allele frequency report.add_sex_to_fam: Standardises FAM file and updates sex annotations from a phenotype sheet.rm_dup_sample: Removes a known duplicate sample (ID 14493) identified during QC.check_het: Computes autosomal heterozygosity on LD-pruned SNPs for outlier detection.make_kgp3_pgen: Prepares the 1000 Genomes Project Phase 3 reference panel in PLINK2 PGEN format for ancestry inference.genotype_qc2hrc: Runs the GenotypeQCtoHRC R pipeline (Singularity container). Performs comprehensive sample and SNP QC, ancestry inference, strand alignment to TOPMed reference, and liftover from hg19 → hg38. Outputs per-chromosome VCFs.cat_genotypes: Concatenates the 22 per-chromosome VCFs from GenotypeQCtoHRC into a single genome-wide VCF.rm_maf_ambig: Removes strand-ambiguous SNPs (A/T or C/G) with MAF > 0.4, where strand cannot be reliably resolved.rm_rare_snps: Removes SNPs with MAF < 0.01.split_chrs: Converts to VCF and splits by chromosome (chr1–22) for TOPMED upload.gather_stats: Counts SNPs per chromosome at the GenotypeQCtoHRC and final stages for QC reporting.geno_pre_report: Renders an RMarkdown HTML QC report summarising all filtering steps.
GenotypeQCtoHRC working directory
The GenotypeQCtoHRC package uses R targets internally and generates all outputs relative to its own repository root. The genotype_qc2hrc rule therefore cds into the local clone before calling GenotypeQCtoHRC.R:
(cd {params.workdir} && Rscript GenotypeQCtoHRC.R \
--file {params.prefix_in} \
--gh TRUE --gh-ref TopMed \
--lo TRUE --lo-in 37 --lo-out 38 \
--clean TRUE)Clone the GenotypeQCtoHRC repository into results/03GENOTYPES-PRE/GenotypeQCtoHRC/ before running this pipeline. See the GenotypeQCtoHRC documentation for required reference files.
Technical Requirements
| Category | Detail |
|---|---|
| Primary tools | PLINK 1.9, PLINK 2.0, BCFtools 1.16 |
| Container | genotype-qc2hrc_latest.sif, seurat5f.sif (R; for report) |
| Env modules | plink/1.9, plink/2.0, bcftools/1.16.0 |
| Input | Raw PED/MAP genotype files (hg19) |
| Output | Per-chromosome hg38 VCFs ready for TOPMed imputation |
Resource Profile
| Rule | Threads | RAM | Walltime | Notes |
|---|---|---|---|---|
| Most rules | 1 | 50 GB | 1h | Minimal requirements |
genotype_qc2hrc |
10 | 100 GB | 5h | Full ancestry QC + liftover |