4 basic files

To run Snakemake you need 4 basic file types:

  • Snakefile
  • Shell script
  • Config file
  • Cluster config file using a Slurm profile

We will discuss these below.


Create dummy fastq files

First navigate to the /scratch/$(whoami)/smk-tutorial/workflow/ directory and create some dummy fastq files:

mkdir -p ../resources/fq
for i in {1..4}; do echo "cortex_ExN_$i" > ../resources/fq/cortex_ExN_$i.fastq ; done

Which should create the following:

tree ../resources/fq/
../resources/fq/
├── cortex_ExN_1.fastq
├── cortex_ExN_2.fastq
├── cortex_ExN_3.fastq
└── cortex_ExN_4.fastq

Snakefile

The first file that we require is the Snakefile. This file is the central file used by Snakemake to run your pipeline. Copy the following code into the Snakefile in the current directory.

configfile: '../config/config.yaml'

rule all:
    input:
	expand("../results/01_process_fq/{sample}.fastq", sample = config['samples'])

rule process_fastqs:
    input:  "../resources/fq/{sample}.fastq"
    output: "../results/01_process_fq/{sample}.fastq"
    shell:
            """

            cat {input} > {output}
            printf "\nfq processed.\n" >> {output}

            """

Snakemake allows you to modularise the various steps of your pipline into discreet modules called rules. At the moment all the rules in our pipeline are contained within the Snakefile, but as our pipeline grows, we can store rules elsewhere and use our Snakefile as the hub for all our sub-processes. At the moment our pipeline contains two rules all and process_fastqs. Let’s read the code from the bottom up.

The process_fastq rule has 3 directives input, output and shell which are fairly self-explanatory. The input and output directives tells snakemake where to look for the input and output files when the process_fastq jobs are run and the shell directive contains the code (or script) that will be run for each instance of process_fastq.

Code in curly brackets after the shell directive are placeholders for the associated named directive. For example, {input} will be replaced with ../resources/{sample}.fastq when the process is run. (We’ll get to what replaces {sample} shortly.)

Every Snakefile must have an all rule. We need to pass all final output files that we want to track to the input directive of rule all for the snakemake process to function properly. At the moment, we only have 1 rule process_fastq so we only need to pass the output files of that rule to rule all for now, but as the pipeline expands we will have to manage how files are tracked from rule to rule and eventally to rule all. We’ll cover this in more detail later.

Finally the first line configfile: '../config/config.yaml' tells snakemake where your config file is.


Config file

A config(uration) file is a file used to set certain parameters and settings for your pipeline. You can use your config file to abstract out project specific information such as that pertaining to samples or donors. In theory, this means that the code in each rule is project agnostic, that is as it only contains general processing information it can be used from project to project all that would need to be changed are the details in your config file.

Run the following to populate your config file:

printf "samples:\n" > ../config/config.yaml
for i in {1..5}; do printf "    cortex_ExN_$i:\n" >> ../config/config.yaml  ; done

Which produces:

cat ../config/config.yaml 
samples:
    cortex_ExN_1:
    cortex_ExN_2:
    cortex_ExN_3:
    cortex_ExN_4:
    cortex_ExN_5:

Here we are creating a reference in our config file called samples which stores sample names associated with our fastq files. Snakemake can reference this to populate the {samples} code in the process_fastq and all rules. Any parameters in curly brackets in the input and output sections of the rule is called a wildcard. We use wildcards as a means to generalise code. They act as placeholders that we can populate with information stored in our config file, in .csv files or that we have generated in a function. The {samples} wildcard with be populated with the sample names stored in cour config file.

Note that if we wish to add or remove samples from the analysis that we only need to alter the config file, there is no need to alter the rules in the Snakefile. If we design our pipelines in this way it means we can reuse code for rules across projects.

We tell snakemake to reference the config file in the all directive: sample = config['samples']. This is telling Snakemake to set whatever is stored in the samples directive in the config file to the variable sample.

The expand function in rule all uses this information to create a list containing every possible unique expansion of the code in the first parameter. So expanding {sample} in "../results/01_process_fq/{sample}.fastq" generates the following list.

["../resources/fq/cortex_ExN_1.fastq", "../resources/fq/cortex_ExN_2.fastq", "../resources/fq/cortex_ExN_3.fastq", "../resources/fq/cortex_ExN_4.fastq"]

As this function is passed to input in rule all, Snakemake will be checking that these 4 files exist at the end of the process.

By contrast, we don’t use the expand function in the input and output sections of the process_fastq rule, so how does Snakemake propagate the {sample} wildcard here? When a wildcard is present without the expand function snakemake will sequentially propagate the the wildcard with each item in the list. In our case, snakemake will run 4 separate instances of the process_fastq rule, one for each sample listed in the config file. This allows us to run the 4 process_fastq jobs in parallel.


So a couple of important concepts here:

  • wildcard - act as placeholders to generalise code that we can populate with specific information
  • expand - ouputs every possible wildcard expansion / no expand wildcards are populated sequentially

Snakemake profile

Snakemake allows you to set up a project specific profile to configure your snakemake and cluster parameters. It allows us to can set default and / or rule specific parameters. If no rule specific parameters are set snakemake will automatically resort to the default. This allows us to have fine control over the resources we set for each rule.

Let’s populate out cluster config file. Copy and paste the following into ../config/profile/config.yaml:

snakefile: Snakefile
cores: 1
use-conda: True
keep-going: True
jobs: 10
rerun-incomplete: True
restart-times: 3
cluster:
  mkdir -p ../results/00LOG/smk-logfiles &&
  sbatch
    --ntasks={threads}
    --mem={resources.mem_mb}
    --job-name=smk-{rule}
    --output=../results/00LOG/smk-logfiles/{rule}.%j.out
    --error=../results/00LOG/smk-logfiles/{rule}.%j.err
    --account=<ADD YOUR ACCOUNT ID HERE>

default-resources:
  - ntasks=1
  - mem_mb=5000
  - time="0-05:00:00"

If you want to run rule specific paramters we set these using the directive flag, but we will get to that later.


snakemake.sh

Finally we need to pull it all together with a shell script. Snakemake automatically manages the job scheduling for the pipeline and tells the cluster what resources to allocate (no need for slurm directive headers). It references both the config.yaml file in the ../config/profile/ directory and the rule specific parameter specifications in each rule (we haven’t set any of the latter yet). All you need to change is the email address to you own so Snakemake can send you a completion or error report when the pipeline completes or and error is thrown.

snakemake --profile ../config/profile/ $@ 2> smk-"`date +"%d-%m-%Y"`".log; mail -s "Snakemake has finished" <YOUR EMAIL HERE> < smk-"`date +"%d-%m-%Y"`".log

Dry run

Before we run our first pipeline we want to do a dry run. This has to be run from the directory that contains the Snakefile. During the dry run snakekame looks for the Snakefile then makes sure that all the input and output files in all rules are accounted for and checks the code for errors. To run the dry run type the following:

snakemake -np

A report is produced listing all the jobs that will be run with the wildcards filled in for each separate job. Notice that the last job in the list is the rule all job.

# Only job stats shown here for brevity. You should see a list of individual jobs too.
Job stats:
job               count    min threads    max threads
--------------  -------  -------------  -------------
all                   1              1              1
process_fastqs        4              1              1
total                 5              1              1

Run the job

All we need to do now is run the snakemake script. However, as we need to run the script interactively we need to set up a terminal multiplexer to run the script in the background. These are described in more detail here. On Hawk screen is pre-installed so we can use that.

To set up a new screen session by typing:

screen -S smk

This will open a new window which allows you to run your snakemake pipeline in the background. Note that this is not the same environment that you were in previously so you need to reactivate the smk-tutorial environment:

/scratch/$(whoami)/smk-tutorial

Now you are set. Run the pipeline by typing:

./snakemake.sh

This will send the 4 process_fastqs to the appropriate Hawk queue:

squeue -u $USER
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
56885128       htc smk_tuto c.c14779 PD       0:00      1 (Priority)
56885127       htc smk_tuto c.c14779 PD       0:00      1 (Priority)
56885126       htc smk_tuto c.c14779 PD       0:00      1 (Priority)
56885125       htc smk_tuto c.c14779 PD       0:00      1 (Priority)

To detach from the screen type ctrl+A simultaneously then type D. The snakemake.sh script will continue running in the background. To reattach the screen you type:

screen -x smk

For a run down of the most common screen commands type screen --help or look here.


Output

Snakemake generates several log files. After the all the jobs have run your folder should look like this:

ls -1
envs
reports
rules
scripts
smk-24-05-2023.log
Snakefile
snakemake.sh

The smk-24-05-2023.log this is the general log file for the snakemake process. If you are running a pipeline with thousands of jobs, this log keeps you up to date with the submission and completions statuses of those jobs:

Submitted job 4 with external jobid 'Submitted batch job 56885324'.

[Wed May 24 15:38:07 2023]
rule process_fastqs:
    input: ../resources/fq/cortex_ExN_1.fastq
    output: ../results/01_process_fq/cortex_ExN_1.fastq
    jobid: 1
    wildcards: sample=cortex_ExN_1
    resources: tmpdir=/tmp

Submitted job 1 with external jobid 'Submitted batch job 56885325'.
[Wed May 24 15:39:06 2023]
Finished job 2.
1 of 5 steps (20%) done
[Wed May 24 15:39:06 2023]
Finished job 3.
2 of 5 steps (40%) done
[Wed May 24 15:39:06 2023]
Finished job 4.
3 of 5 steps (60%) done
[Wed May 24 15:39:06 2023]
Finished job 1.
4 of 5 steps (80%) done
Select jobs to execute...

[Wed May 24 15:39:06 2023]
localrule all:
    input: ../results/01_process_fq/cortex_ExN_1.fastq, ../results/01_process_fq/cortex_ExN_2.fastq, ../results/01_process_fq/cortex_ExN_3.fastq, ../results/01_process_fq/cortex_ExN_4.fastq
    jobid: 0
    resources: tmpdir=/tmp

[Wed May 24 15:39:06 2023]
Finished job 0.
5 of 5 steps (100%) done

If any job fails for any reason this is the first thing you would check to try to acertain at which point the pipeline failed. A copy of this log file is sent to your email addresss.

The individual log std out and error logs files generated by Slurm for each individual process_fastqs job. Are located in the ../results/00LOG/smk-logfiles directory. Note that the name contains both the Slurm ID for the job and the rule the job came from so that you can cross-reference these when necessary.

Lastly, lets quickly check that the ouput files were generated and their content is correct.

ls -1 ../results/01_process_fq/ 
cortex_ExN_1.fastq
cortex_ExN_2.fastq
cortex_ExN_3.fastq
cortex_ExN_4.fastq

cat ../results/01_process_fq/cortex_ExN_*
cortex_ExN_1
fq processed.
cortex_ExN_2
fq processed.
cortex_ExN_3
fq processed.
cortex_ExN_4
fq processed.

Great. All we have done here is copy the fastq files from the resources to results directory and added the text fq processed. to each file. At this stage, it doesn’t really matter what each job is actually doing. It is improtant to understand that snakemake is not interested in what each job actually does. It is only interested in tracking files and setting parameters in order to communicate with Hawk, schedule your jobs and run your pipeline efficently.


Move on to Run basic process, or back to Run basic process.


... ...