01 · Parse Alignment

Overview

This pipeline processes raw multiplexed FASTQ libraries from the Parse Biosciences SPLiT-seq protocol into per-plate Digital Gene Expression (DGE) matrices. It handles reference building, multi-lane FASTQ concatenation, per-sublibrary alignment and barcode demultiplexing, and final plate-level combination.


graph TD
    classDef snazzy fill:#f1f1f1,stroke:#333,stroke-width:2px,color:#000;
    classDef out fill:#e1f5fe,stroke:#01579b,color:#000;

    A[ENSEMBL hg38 FA + GTF] --> B(get_refs)
    B --> C(mk_ref)
    D[samples_plateN.json] --> E(cat_fqs)
    C --> F(run_parse)
    E --> F
    F --> G(run_parse_combine)
    G --> H[(Plate DGE Matrix)]

    class A,B,C,D,E,F,G snazzy
    class H out

  1. get_refs: Downloads the Ensembl GRCh38 release 113 FASTA and GTF from the Ensembl FTP.
  2. make_ref: Builds a split-pipe genome index from the reference files.
  3. cat_fqs: Concatenates per-lane FASTQ files for each sublibrary into single R1 and R2 files.
  4. run_parse: Runs split-pipe --mode all on each sublibrary: barcode error correction, read trimming, genome alignment, and DGE generation.
  5. run_parse_combine: Aggregates all sublibraries from a plate into a unified plate-level DGE matrix using

Warning

Environment clash workaround

There is an irresolvable conflict between the Snakemake YAML parser and the spipe-1.3.1 conda environment (both use yaml and dir internally). The workaround is to activate the spipe environment manually with source activate spipe-1.3.1 inside the shell block of each rule, rather than using the Snakemake conda: directive.


FASTQ Manifest

FASTQ paths are not hardcoded. Instead, create_parse_json.py crawls the raw sequencing directories and produces a per-plate JSON manifest (e.g. config/samples_plate3.json) mapping each sample ID to its R1/R2 file lists across all lanes and runs. Snakemake reads this at runtime:

MERGE_FQ = json.load(open(config['MERGE_FQ_JSON']))
ALL_SAMPLES = sorted(MERGE_FQ.keys())

rule cat_fqs:
    input:
        r1 = lambda wildcards: MERGE_FQ[wildcards.sample]['R1'],
        r2 = lambda wildcards: MERGE_FQ[wildcards.sample]['R2']

Update MERGE_FQ_JSON in config/config.yaml to point to the correct plate manifest before running.


Technical Requirements

Category Requirement
Software Parse split-pipe v1.3.1
Environment spipe-1.3.1
Reference ENSEMBL Release 113 (hg38)
Inputs Per-lane demultiplexed FASTQs; JSON manifest; Sample List
Disk Space ~3-5 TB per plate (Scratch)

Library Composition

We processed 150 samples across three sequencing plates:

Category Detail
Plates 3 (Plate 1: 15 sublibraries, Plate 2: 16, Plate 3: 12)
FASTQs per sublibrary 24 files (4 lanes × 3 runs × 2 reads)
Total raw data ~3.0 TB (~1 TB per plate)

Resource Profile

Rule Threads RAM Walltime Notes
get_refs 1 5 GB 1h Local rule; wget
mk_ref 6 64 GB 1d One-time index build
cat_fqs 1 5 GB Local rule; shell cat
run_parse 32 360 GB 10d Per sublibrary
run_parse_combine 32 360 GB 3d Per plate

Storage Management

To stay within the 5 TB scratch quota, intermediate merged FASTQs are declared temp() in Snakemake and deleted immediately on successful alignment. The cat_fqs and run_parse rules are therefore chained tightly and concurrency is restricted to avoid simultaneous accumulation of multiple large merged FASTQ sets.


Back to top