Converting SRA files to fastq files
Converting SRA files to fastq files
Next we need to convert our SRA files to fastq files.
A fastq file is a text based file which contains seqeuence information (A,T,C,G) for all the short reads produced by a high throughput sequencer. Each read contains 4 lines of information in the fastq file and each individual base in a particular read has an associated quality score. For more information see here.
To do this serially we use fastq-dump
, which is also part of the SRA tools package.
Single ended data
Our data is paired ended, but for the sake of completeness we can extract single ended files using a for
loop to pass that passed the sra files we downloaded to fastq-dump
one at a time.
for file in $(<sra_list.txt)
do
fastq-dump ~/ncbi/public/sra/${file}.sra --outdir ~/fastq_files --gzip
done
Note that we add the location that the files were downloaded to ~/ncbi/public/sra/
and the .sra
to suffix
to each sra number we read in. We specify the output directory using the --outdir
flag and the --gzip
flag tells the program to zip, or compress, the fastq files to save storage space.
Paired ended data
For paired ended data we only add one extra parameter --split-reads
to the above line of code:
for file in $(<sra_list.txt)
do
fastq-dump ~/ncbi/public/sra/${file}.sra --outdir ~/fastq_files --split-files --readids --gzip
done
This extracts the paired ended data into separate files. Both the --split-files
and --readids
flags are
essential here otherwise paired end information is not annotated correctly. the documentation for the
fastq-dump
package is rubbish so it not always immediately obvious which flags are required and why, this
site provides a much better description of what is needed.
This part of the pipeline can be a bit of a bottleneck so there is another package called parallel-fastq-dump
that can speed up the process.
Parallelisation
To speed up this process we effectively use the same parameters but add a --threads
parameter. I set this
to 8, which allocates 8 processors to the job but beware when parallelising jobs, you don’t want to allocate
all the processors you have at once as nothing will be left to run essential background processes and your
computer will crash. Noramally you retain one processor for background processes and assign the rest to the
job.
The cluster parameters I set for this are #$ -pe smp 8
, #$ -l h_vmem=5G
so thats 40Gs of memory allocated
in total for the fastq file conversion. We also add the --sra-id
` parameter before the input file when
using parallel-fastq-dump.
for file in $(<sra_list.txt)
do
parallel-fastq-dump --sra-id ~/ncbi/public/sra/${file}.sra --threads 8 \
--outdir /c8000xd3/big-c1477909/PhD/sra_files/ --split-files --gzip --readids
done
The cluster parameters I set for this are #$ -pe smp 8
, #$ -l h_vmem=5G
so thats 40Gs of memory allocated in
total for the fastq file conversion.
Rename your files
Open a text file in nano called fastq_names.txt:
nano fastq_names.txt
Then type memorable names for each of the fastq files you have generated in the text file.
Mono_1
Mono_2
Tcell_1
Tcell_2
Save the file by pressing ctrl x, then y, then enter.
Then as decribed earlier we use a while loop to remove the old sra numbers from our fastq files into something more meaningful.
while read -r old new; do
echo mv "${old}.fastq.gz" "${new}.fastq.gz"
done < <(paste SRA_list.txt fastq_names.txt)
We are now ready to do some simple quality control on the fastq files.
Move on to Quality control - FastQC/MultiQC, or back to Download SRA files.