02 · scRNAseq analysis

Overview

This pipeline transitions raw per-plate DGE matrices (H5AD format) into biologically annotated cell clusters and pseudobulk expression matrices ready for eQTL mapping. Analysis is implemented in a series of Jupyter notebooks executed non-interactively by [Papermill] (https://papermill.readthedocs.io), with each rule producing a rendered HTML report alongside the notebook output.

Workflow Logic

graph TD
    classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
    classDef out fill:#e1f5fe,stroke:#01579b,color:#000;

    A[(Plate 1–3 H5AD)] --> B(scanpy_qc)
    B --> C(scanpy_clustering)
    C --> D(scanpy_subclustering)
    C --> E(scanpy_annotation)
    C --> F(scanpy_cell_label_and_pseudobulk)

    class A,B,C,D,E,F snazzy

scanpy_qc: Per-plate quality control — filters cells by mitochondrial read fraction, minimum gene count, and UMI count thresholds etc. Runs in parallel across the three plates.
scanpy_clustering: Iterative dimensionality reduction and clustering across three sequential rounds. Each round performs PCA, UMAP, and Leiden graph clustering at increasing resolution, with manual inspection between rounds to remove low-quality cells / clusters.
scanpy_subclustering: Generates fine-grained subclusters for the four most abundant level-1 populations (Glu-UL, Glu-DL, GABA, NPC), producing the 13 subtype labels used for eQTL mapping.
scanpy_annotation: Cell-type labelling using canonical marker gene expression.
scanpy_cell_label_and_pseudobulk: Transfers final cell labels, then sum-aggregates single-cell counts per donor per cell type to produce pseudobulk matrices for TensorQTL.

Note

Notebook execution pattern

Every rule uses Papermill to execute the notebook non-interactively, passing a plate parameter to parameterise the run. The executed notebook is then converted to HTML by jupyter nbconvert for QC review:

papermill {input.nb} {params.nb_out} -p plate {wildcards.plate}
jupyter nbconvert --to html {params.nb_out} --output {params.html_out}

Technical Requirements

Category	Detail
Environment	`eqtl_study` conda (Scanpy 1.10, Papermill, Jupyter)
Container	— (conda only for this pipeline)
Key packages	`scanpy`, `anndata`, `leidenalg`, `umap-learn`, `scrublet`
Input	Per-plate H5AD from `01PARSE/combine_{plate}/all-sample/DGE_filtered/anndata.h5ad`)
Output	Annotated AnnData object; per-cell-type pseudobulk BED files

Cell Types Produced

Level	Labels
Level 1 (broad)	Glu-UL, Glu-DL, NPC, GABA, Endo-Peri, OPC, MG
Level 2 (subtypes)	Glu-UL-0/1/2, Glu-DL-0/1/2, GABA-0/1/2, NPC-0/1/2

All 19 labels are propagated automatically through the downstream pipelines via the cell_types list in config/config.yaml.

Resource Profile

Rule	Threads	RAM	Walltime	Notes
`scanpy_qc`	16	160 GB	3h	Per-plate parallel execution
`scanpy_clustering`	16	380 GB	3d	Full dataset; iterative
`scanpy_subclustering`	16	380 GB	3d	Subsets of clustering output
`scanpy_annotation`	10	200 GB	3d	Interactive-style notebook
`scanpy_cell_label_and_pseudobulk`	10	100 GB	3d	Sum aggregation per donor/cell type