Quality control - fastq files

It is important to check the quality of the fastq files you have generated. It is possible, though rare, that files are corrupted whilst downloading them from the public repository. Similarly the group that uploaded the data may have been less/more stringent in their quality control considerations than your own group.

FastQC

FastQC provides an html document for each fastq file with key quality control metrics.

The details of what the specific metrics measure in the output html is explained extensively on the web, see this video made by the developers of FastQC, so I wont cover this here.

The command to run fastqc is simple.

for file in `ls ~/fastq_files`
do

    fastqc ${file} -o ~/fastq_files/FastQC/

done

This is a fairly rapid process but can be parallelised by adding the -t parameter followed by an integer for the number of fastq files in the folder.

MultiQC

Often we have sereral fastqc htmls to inspect at once, the package MultiQC, condenses all the information from all the FastQC reports into a single document making easier to quickly assess if all the fastq files pass the quality checks.

First navigate to the fastq_file directory.

cd fastq_files

Then run mutiqc. We don’t need a loop here as multiqc searches for all the fastqc html files in the current directory and adds them to the multiqc report.

multiqc .

The . is linux shorthand for current directory. We would get the same result using:

multiqc ~/fastq_files

If we are working on can then inspect the multiqc.html in your web browser. To open this from terminal on a mac use the open command.

open multiqc.html

However of you are running this on a cluster the easiest way to visualise the html report is to install a file trasfer package such as Filezilla and transferring the html from the cluster to you machine.

Common failures during QC

Often you will find that fastq files fail QC as the reads still have adapters attached to them. Adapters are essentially barcodes that are added to the reads in order that the sequencer can identify one read from the other. Most modern sequencers recognise adapter sequences and remove them from the fastq files automatically, but for data generated on older machines these need to be removed programmatically. See the Trim Adapter section for an explanation of how to do this.

Move on to Trim adapters, or back to SRA to fastq.