02 · scRNAseq analysis


Overview

This pipeline transitions raw per-plate DGE matrices (H5AD format) into biologically annotated cell clusters and pseudobulk expression matrices ready for eQTL mapping. Analysis is implemented in a series of Jupyter notebooks executed non-interactively by [Papermill] (https://papermill.readthedocs.io), with each rule producing a rendered HTML report alongside the notebook output.


Workflow Logic

graph TD
    classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
    classDef out fill:#e1f5fe,stroke:#01579b,color:#000;

    A[(Plate 1–3 H5AD)] --> B(scanpy_qc)
    B --> C(scanpy_clustering)
    C --> D(scanpy_subclustering)
    C --> E(scanpy_annotation)
    C --> F(scanpy_cell_label_and_pseudobulk)

    class A,B,C,D,E,F snazzy

  1. scanpy_qc: Per-plate quality control — filters cells by mitochondrial read fraction, minimum gene count, and UMI count thresholds etc. Runs in parallel across the three plates.
  2. scanpy_clustering: Iterative dimensionality reduction and clustering across three sequential rounds. Each round performs PCA, UMAP, and Leiden graph clustering at increasing resolution, with manual inspection between rounds to remove low-quality cells / clusters.
  3. scanpy_subclustering: Generates fine-grained subclusters for the four most abundant level-1 populations (Glu-UL, Glu-DL, GABA, NPC), producing the 13 subtype labels used for eQTL mapping.
  4. scanpy_annotation: Cell-type labelling using canonical marker gene expression.
  5. scanpy_cell_label_and_pseudobulk: Transfers final cell labels, then sum-aggregates single-cell counts per donor per cell type to produce pseudobulk matrices for TensorQTL.

Note

Notebook execution pattern

Every rule uses Papermill to execute the notebook non-interactively, passing a plate parameter to parameterise the run. The executed notebook is then converted to HTML by jupyter nbconvert for QC review:

papermill {input.nb} {params.nb_out} -p plate {wildcards.plate}
jupyter nbconvert --to html {params.nb_out} --output {params.html_out}

Technical Requirements

Category Detail
Environment eqtl_study conda (Scanpy 1.10, Papermill, Jupyter)
Container — (conda only for this pipeline)
Key packages scanpy, anndata, leidenalg, umap-learn, scrublet
Input Per-plate H5AD from 01PARSE/combine_{plate}/all-sample/DGE_filtered/anndata.h5ad)
Output Annotated AnnData object; per-cell-type pseudobulk BED files

Cell Types Produced

Level Labels
Level 1 (broad) Glu-UL, Glu-DL, NPC, GABA, Endo-Peri, OPC, MG
Level 2 (subtypes) Glu-UL-0/1/2, Glu-DL-0/1/2, GABA-0/1/2, NPC-0/1/2

All 19 labels are propagated automatically through the downstream pipelines via the cell_types list in config/config.yaml.


Resource Profile

Rule Threads RAM Walltime Notes
scanpy_qc 16 160 GB 3h Per-plate parallel execution
scanpy_clustering 16 380 GB 3d Full dataset; iterative
scanpy_subclustering 16 380 GB 3d Subsets of clustering output
scanpy_annotation 10 200 GB 3d Interactive-style notebook
scanpy_cell_label_and_pseudobulk 10 100 GB 3d Sum aggregation per donor/cell type

Back to top