Introduction

This section contains some basic linux information and tips for setting up an environment for bioinformatics, but imo the bare essentials to get started withour prior programming knowledge. It contains elementary information so experienced users should skip this.

This is by no means exaustive, and not the best resource available, it contains enough to get started and is the ‘obvious’ stuff I wish I knew when I was starting out.


Open a terminal

To open a terminal on a mac press and hold cmd then press space to open spotlight. Type ‘terminal’.

This opens a terminal window in your home directory. For instructions on how to do this on a Windows machine, see the relevant documentation (e.g., Windows Terminal, PowerShell, or WSL).

The terminal is just an alternative way to access the filing system on your computer without using the graphical user interface (GUI). So instead of seeing icons representing folders and programs, we see text-based lists. When we open a terminal window, we are typically located in our home directory.


To carry out the tasks via the command line we need to navigate between directories. To find out where we are at any given time, type:

pwd

This prints the current working directory, where you are located in the file hierarchy, to the screen. To list all directories and files in the current directory, type:

ls

Directories are often displayed in a different colour (commonly blue), but this colour scheme may vary or be absent depending on your system configuration.

To move between directories, use the cd command:

cd Desktop

This changes the current directory to Desktop. Typing ls again will list the contents of the Desktop directory.

To move up one level in the directory hierarchy:

cd ..

At any time, to return to the home directory from anywhere in the file hierarchy, type:

cd

# OR

cd ~ 

The tilde (~) is shorthand for your home directory.


Relative and absolute paths

From the Desktop directory, you can navigate using either relative or absolute paths.

A relative path is interpreted relative to your current working directory. For example, if you are currently in:

/Users/Darren/Desktop/Programs/

and there is a subdirectory called Homer inside Programs, you can move into it with:

Typing the absolute path would do the same thing from your currently location:

cd Homer

An absolute path specifies the full path from the root directory (/). For example:

cd /Users/Darren/Desktop/Programs/Homer

An absolute path will take you to that directory from anywhere in the file system. A relative path only works relative to your current location. Note: cd /Homer would attempt to access a directory named Homer directly under the root (/), which is not the same as a subdirectory of your current location.


Autocompletion

A useful feature of the command line is tab completion.

Instead of typing long directory names in full, you can type the first few letters of a directory or file name and press the Tab key to autocomplete it.

If multiple files or directories begin with the same prefix, press Tab twice to display the available options. You may need to type additional letters until the name is uniquely identifiable.

You can also autocomplete through multiple directory levels consecutively if you know their names.

Tab completion is case-sensitive on most Unix-like systems.


Assigning variables

Variables are used to assign information in programming. In Bash, variables are assigned using the = operator. The variable is on the left and the information we want to assign is on the right. It works similarly to algebra. Note that, in Bash, there must be no spaces on either side of the =.

x=1

To access the value of a variable, prefix it with $:

echo $x

#> 1

Note that variables defined this way exist only in the current shell session unless exported or defined in a startup file.

The echo command prints its arguments to the screen.


Hidden files

When you type ls, not all files in the directory are shown. Many directories contain hidden files that are used for configuration and background processes.

Hidden files typically begin with a dot (.).

To display all files, including hidden ones:

ls -a

You will see files such as:

.bash_profile

Files beginning with . are hidden by default. Modifying these files can affect shell behaviour, so changes should be made carefully.


Subsetting fastq files for testing

Often when debugging a script we want to subset our fastq files so that we can significantly reduce the processing time for every step in our script rather than run all the data through at once.

# read the compressed forward fasq file, 
# piping the result in sed to extract the 
# first 4 million lines (equivalent to 1M sequences, 
# as one sequence entry always span 4 lines)
# and finally compressing it as a new file

zcat pair_1.fastq.gz | sed -n 1,20000000p | gzip -c > pair-subset_1.fastq.gz

# same for read 2 (the reverse read) if your data is Paired-End
zcat pair_2.fastq.gz | sed -n 1,20000000p | gzip -c > pair-subset_2.fastq.gz

Move on to Scope, or back to Index.


... ...