04 · Genotype QC (Post-imputation)

Overview

This pipeline processes the imputed VCF files returned by the TOPMed Imputation Server into a final analysis-ready genotype dataset. It handles password-protected zip extraction, imputation quality assessment, rsID annotation from dbSNP, multi-chromosome concatenation, quality filtering, reference validation, and PCA-based genotype covariate generation for TensorQTL.

Workflow Logic

graph TD
    classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
    classDef out fill:#e1f5fe,stroke:#01579b,color:#000;

    A[TOPMED Imputed ZIPs chr1–22+X] --> B(impute_check)
    B --> C(impute_check_cat)
    D[dbSNP build 156 VCF] --> E(dwnld_dbsnp_ref)
    B --> F(add_rsID)
    E --> F
    F --> G(vcf_cat)
    G --> H(filter_tags)
    H --> I(check_VCF)
    I --> J(exclude_SNPs)
    J --> K(idx_vcf)
    K --> L(get_sample_list)
    J --> M(vcf_to_plink)
    M --> N(get_ld_pruned_snps)
    N --> O(prune_genotypes)
    O --> P(calc_genotype_pcs)
    C --> Q(create_combined_log)
    G --> Q
    H --> Q
    I --> Q
    J --> Q
    P --> R[(Final VCF + PLINK + PCA covariates)]

    class A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q snazzy
    class R out

impute_check: Extracts per-chromosome imputed VCFs from password-protected TOPMed zip files, parses the .info files, and summarises SNP counts by MAF (< / ≥ 0.05) and imputation R² (< / ≥ 0.8).
impute_check_cat: Concatenates per-chromosome imputation QC summaries into a single TSV.
dwnld_dbsnp_ref: Downloads the dbSNP build 156 VCF (hg38) for rsID annotation.
add_rsID: Annotates each imputed chromosome VCF with dbSNP rsIDs using bcftools annotate.
vcf_cat: Concatenates all rsID-annotated per-chromosome VCFs into a genome-wide VCF.
filter_tags: Computes HWE p-values with bcftools +fill-tags and filters to retain SNPs with MAF ≥ 0.05, R² ≥ 0.8, and HWE p-value ≥ 0.0001.
check_VCF: Validates the filtered VCF against the hg38 reference using checkVCF.py, generating an exclusion list of problematic variants.
exclude_SNPs: Removes variants flagged by check_VCF.
idx_vcf: Creates a tabix index on the final VCF.
get_sample_list: Extracts the final list of sample IDs from the VCF.
vcf_to_plink: Converts the final VCF to PLINK binary format.
get_ld_pruned_snps: Performs LD pruning (window 250 SNPs, step 5, r² threshold 0.2) for PCA.
prune_genotypes: Extracts the LD-pruned SNP set into a new PLINK binary file.
calc_genotype_pcs: Computes the top 10 genotype principal components. The first 4 are used as covariates in TensorQTL.
create_combined_log: Aggregates per-step SNP counts and QC summaries into a single combined processing log.

Filtering Thresholds

Filter	Threshold	Tool
MAF	≥ 0.05	BCFtools
Imputation R²	≥ 0.8	BCFtools
HWE p-value	≥ 0.0001	BCFtools +fill-tags
Reference validation	checkVCF exclusion list	checkVCF.py

Technical Requirements

Category	Detail
Primary tools	BCFtools, PLINK 1.9, PLINK 2.0, tabix, samtools
Env modules	`bcftools`, `plink/1.9`, `samtools`
Reference	dbSNP build 156 (hg38); GRCh38 primary assembly FASTA
Input	TOPMed imputed zip archives (chr1–22, X)
Output	`chrALL_final.filt.vcf.gz` (indexed); PLINK binary set; `pca.eigenvec`

Resource Profile

Most rules are I/O-bound and require minimal compute. The pipeline runs efficiently with default resources (1 thread, 5 GB RAM) for the majority of steps. The filter_tags and vcf_cat rules on a genome-wide VCF may require up to 20 GB RAM depending on chromosome X size.