Aligning reads to reference genome

To align the short reads in our fastq files to a reference genome we use the alignement package Bowtie2, however, we must first download a reference genome to align our reads to. We only need to do this once for each genome we intend to use. These genome files also need to be indexed.

Downloading reference genomes

Most relevant human data currently in the literature is aligned to hg19 but as hg38 is the most up to date genome build it will eventually become the standard as groups move across to it. Note the hg and GRCh notations are used interchangebly.

hg19 = GRCh37 from 2009
hg38 = GRCh38 from 2013

Our data were generated from human tissue using the hg19 build. So we need to download the hg19 build of the human genome.

We download the reference from UCSC.

From your home folder navigate to the genomes_and_index_files folder:

cd genomes_and_index_files

Then type the following:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

The wget command is a means of downloading files from the web using linux. This is quite a big file so may take a wee while. Once complete, we have downloaded a fasta file containing the entire human genome.

Indexing the reference genome

Before aligning our reads we need to first index our reference genome. This allows Bowtie2 to access the genome in a more efficient manner. Indexing only needs to be done once per genome.

bowtie2-build hg19.fa hg38

This creates a few additional files in your genomes_and_index_files folder.

Aligning reads to hg19 with Bowtie2

To align the reads in our fastq files we write a for loop to run through all the files in our fastq_files folder, one at a time.

for file in `ls ~/fastq_files`
do

    prefix=$(basename ${sample} ".fastq.gz")
    echo ${prefix}

    bowtie2 -p 7 -x ~/genomes_and_index_files/hg19 \
    -U ${file}  -S ~/sam_files/${prefix}_hg19.sam

done

The -p parameter is for parallelisation and assigns 7 processors to the job (only do this on the cluster - unless you know how many processors you have on your machine). This will generate a sam file for each of your fastq files in the sam_files folder (specified using the -S parameter).

This command is for single ended data, and the single input file is passed to the -U flag. If your data is paired ended, the flags are -1 and -2. Regardless of whether your reads are single or paired ended a single sam file is always produced by Bowtie2.

Remember: If you had to trim adapters you need to change the input folder in the first line of the for loop to ~/trimmed_fastq_files

Additonal tips

There are a couple of useful extra tips to introduce here.

Notice the use of the basename function in the for loop. This is a concise way to strip the directory tree and suffixes from filenames.

basename /home/David_Bowie/fastq_files/fastq_1.fastq.gz ".fastq.gz"

The above command returns only fastq1, removing everything before the rightmost forward slash, and whatever suffix you specify in the quotes.

This is extremely useful can assign this abbreviation to a variable which you can call to preserve naming continuity when generating different file formats. Here we assign the abbreviation to the variable prefix.

Signposting - echo statements

Many programs do not produce output as they process files, so when scripting in bash it helps to manually add echo statements to inform us of what processing has been completed and what there is still to do.

This is important when running loops on multiple files that takes considerable to complete, and even more important if you’re writing a script to automate multiple processes. Occasionally, sessions will crash for unexplainable reasons, so without simple signposts to let us know what was completed safely before the crash, it can be difficult to indentify those files that may have been corrupted.

Move on to Converting sam files to bam files, or back to Trim adapters.