graph TD
classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
classDef out fill:#e1f5fe,stroke:#01579b,color:#000;
A[(Plate 1–3 H5AD)] --> B(scanpy_qc)
B --> C(scanpy_clustering)
C --> D(scanpy_subclustering)
C --> E(scanpy_annotation)
C --> F(scanpy_cell_label_and_pseudobulk)
class A,B,C,D,E,F snazzy
02 · scRNAseq analysis
Overview
This pipeline transitions raw per-plate DGE matrices (H5AD format) into biologically annotated cell clusters and pseudobulk expression matrices ready for eQTL mapping. Analysis is implemented in a series of Jupyter notebooks executed non-interactively by [Papermill] (https://papermill.readthedocs.io), with each rule producing a rendered HTML report alongside the notebook output.
Workflow Logic
scanpy_qc: Per-plate quality control — filters cells by mitochondrial read fraction, minimum gene count, and UMI count thresholds etc. Runs in parallel across the three plates.scanpy_clustering: Iterative dimensionality reduction and clustering across three sequential rounds. Each round performs PCA, UMAP, and Leiden graph clustering at increasing resolution, with manual inspection between rounds to remove low-quality cells / clusters.scanpy_subclustering: Generates fine-grained subclusters for the four most abundant level-1 populations (Glu-UL, Glu-DL, GABA, NPC), producing the 13 subtype labels used for eQTL mapping.scanpy_annotation: Cell-type labelling using canonical marker gene expression.scanpy_cell_label_and_pseudobulk: Transfers final cell labels, then sum-aggregates single-cell counts per donor per cell type to produce pseudobulk matrices for TensorQTL.
Note
Notebook execution pattern
Every rule uses Papermill to execute the notebook non-interactively, passing a plate parameter to parameterise the run. The executed notebook is then converted to HTML by jupyter nbconvert for QC review:
papermill {input.nb} {params.nb_out} -p plate {wildcards.plate}
jupyter nbconvert --to html {params.nb_out} --output {params.html_out}Technical Requirements
| Category | Detail |
|---|---|
| Environment | eqtl_study conda (Scanpy 1.10, Papermill, Jupyter) |
| Container | — (conda only for this pipeline) |
| Key packages | scanpy, anndata, leidenalg, umap-learn, scrublet |
| Input | Per-plate H5AD from 01PARSE/combine_{plate}/all-sample/DGE_filtered/anndata.h5ad) |
| Output | Annotated AnnData object; per-cell-type pseudobulk BED files |
Cell Types Produced
| Level | Labels |
|---|---|
| Level 1 (broad) | Glu-UL, Glu-DL, NPC, GABA, Endo-Peri, OPC, MG |
| Level 2 (subtypes) | Glu-UL-0/1/2, Glu-DL-0/1/2, GABA-0/1/2, NPC-0/1/2 |
All 19 labels are propagated automatically through the downstream pipelines via the cell_types list in config/config.yaml.
Resource Profile
| Rule | Threads | RAM | Walltime | Notes |
|---|---|---|---|---|
scanpy_qc |
16 | 160 GB | 3h | Per-plate parallel execution |
scanpy_clustering |
16 | 380 GB | 3d | Full dataset; iterative |
scanpy_subclustering |
16 | 380 GB | 3d | Subsets of clustering output |
scanpy_annotation |
10 | 200 GB | 3d | Interactive-style notebook |
scanpy_cell_label_and_pseudobulk |
10 | 100 GB | 3d | Sum aggregation per donor/cell type |