graph TD
classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
classDef out fill:#e1f5fe,stroke:#01579b,color:#000;
A[ENSEMBL hg38 FA + GTF] --> B(get_refs)
B --> C(mk_ref)
D[samples_plateN.json] --> E(cat_fqs)
C --> F(run_parse)
E --> F
F --> G(run_parse_combine)
G --> H[(Plate DGE Matrix)]
class A,B,C,D,E,F,G snazzy
class H out
01 · Parse Alignment
Overview
This pipeline processes raw multiplexed FASTQ libraries from the Parse Biosciences SPLiT-seq protocol into per-plate Digital Gene Expression (DGE) matrices. It handles reference building, multi-lane FASTQ concatenation, per-sublibrary alignment and barcode demultiplexing, and final plate-level combination.
get_refs: Downloads the Ensembl GRCh38 release 113 FASTA and GTF from the Ensembl FTP.make_ref: Builds a split-pipe genome index from the reference files.cat_fqs: Concatenates per-lane FASTQ files for each sublibrary into single R1 and R2 files.run_parse: Runssplit-pipe --mode allon each sublibrary: barcode error correction, read trimming, genome alignment, and DGE generation.run_parse_combine: Aggregates all sublibraries from a plate into a unified plate-level DGE matrix using
Environment clash workaround
There is an irresolvable conflict between the Snakemake YAML parser and the spipe-1.3.1 conda environment (both use yaml and dir internally). The workaround is to activate the spipe environment manually with source activate spipe-1.3.1 inside the shell block of each rule, rather than using the Snakemake conda: directive.
FASTQ Manifest
FASTQ paths are not hardcoded. Instead, create_parse_json.py crawls the raw sequencing directories and produces a per-plate JSON manifest (e.g. config/samples_plate3.json) mapping each sample ID to its R1/R2 file lists across all lanes and runs. Snakemake reads this at runtime:
MERGE_FQ = json.load(open(config['MERGE_FQ_JSON']))
ALL_SAMPLES = sorted(MERGE_FQ.keys())
rule cat_fqs:
input:
r1 = lambda wildcards: MERGE_FQ[wildcards.sample]['R1'],
r2 = lambda wildcards: MERGE_FQ[wildcards.sample]['R2']Update MERGE_FQ_JSON in config/config.yaml to point to the correct plate manifest before running.
Technical Requirements
| Category | Requirement |
|---|---|
| Software | Parse split-pipe v1.3.1 |
| Environment | spipe-1.3.1 |
| Reference | ENSEMBL Release 113 (hg38) |
| Inputs | Per-lane demultiplexed FASTQs; JSON manifest; Sample List |
| Disk Space | ~3-5 TB per plate (Scratch) |
Library Composition
We processed 150 samples across three sequencing plates:
| Category | Detail |
|---|---|
| Plates | 3 (Plate 1: 15 sublibraries, Plate 2: 16, Plate 3: 12) |
| FASTQs per sublibrary | 24 files (4 lanes × 3 runs × 2 reads) |
| Total raw data | ~3.0 TB (~1 TB per plate) |
Resource Profile
| Rule | Threads | RAM | Walltime | Notes |
|---|---|---|---|---|
get_refs |
1 | 5 GB | 1h | Local rule; wget |
mk_ref |
6 | 64 GB | 1d | One-time index build |
cat_fqs |
1 | 5 GB | — | Local rule; shell cat |
run_parse |
32 | 360 GB | 10d | Per sublibrary |
run_parse_combine |
32 | 360 GB | 3d | Per plate |
Storage Management
To stay within the 5 TB scratch quota, intermediate merged FASTQs are declared temp() in Snakemake and deleted immediately on successful alignment. The cat_fqs and run_parse rules are therefore chained tightly and concurrency is restricted to avoid simultaneous accumulation of multiple large merged FASTQ sets.