Cheatsheet of useful commands¶

Here are some useful bash commands for doing command-line bioinformatics. These draw upon some of the basics described in other pages in this section.

Unwrapping FASTA files¶

A lot of software outputs wrapped FASTA files, where sequence lines are truncated at a certain number of characters and continued on the following line, like this:

>sequence1
ATACGACGATAAGCAGACTAGACATATT
CGCCCGATTGACCG
>sequence2
ACGCGCTGAAAAGCAGACTTTTTACACG
CGCGTAAACAACGC

This can be really annoying for quick and easy command line bash bioinformatics, because the number of lines for each sequence is unpredictable. The following perl one-liner, from this biostars answer, easily unwraps a FASTA file:

perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' input.fasta > output.fasta

This command is great because it can easily be used within a pipe.

Replacing bases¶

You may sometimes have sequence data that isn’t quite appropriate for a piece of software, for example you have an alignment but the software won’t accept it, or you have ambiguity codes but the software only accepts A, T, C, G or N. You can use sed to remove or replace all offending bases, but you must be careful only to do so on non-header lines - you don’t want to edit your sequence headers by accident! We do this by adding a line specifier, /^[^>]/, before the substitution command. This specifies that we want to only operate on lines that do not start with >.

Degap an alignment (i.e. de-align):

sed -e "/^[^>]/s/-//g" input.fasta > output.fasta

Replace any ambiguities with N:

sed -e "/^[^>]/s/[YRWSKMDVHBX]/N/g" input.fasta > output.fasta

Getting sequence lengths¶

Here we delete header lines, delete gaps then count how many characters each line has.

sed -e "/^>/d" -e "s/-//g" input.fasta | awk '{ print length }' > lengths.txt

The FASTA file must be unwrapped - if it’s not, you could add the perl one-liner above to the beginning of the pipe.

perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' input.fasta |
sed -e "/^>/d" -e "s/-//g" | awk '{ print length }' > lengths.txt

What to with these lengths? Well, we can get a rough idea of the distribution by counting the number of each unique length:

sed -e "/^>/d" -e "s/-//g" input.fasta |
awk '{ print length }' |
sort | uniq -c | sort -hr

Or we could use an R one-liner to count the mean and standard deviation. Note that we’ve split this over lines to

sed -e "/^>/d" -e "s/-//g" input.fasta |
awk '{ print length }' |
Rscript -e 'd<-scan("stdin",quiet=TRUE);
            print(c(round(mean(d),0),
                    round(100 * mean(c(mean(d)-min(d),
                                        max(d)-mean(d)))/mean(d),
                          1)));'

Generating a random FASTA¶

Sometimes we want some dummy data to test on. This command makes 30 sequences, each 300 bases long:

paste -d '\n'     <(for i in {01..30}; do echo ">seq$i"; done)     <( cat /dev/urandom | tr -dc 'ATCG' | fold -w 300 | head -n 30 )     > output.fasta

Get the headers of a FASTA¶

There are two ways to do this:

grep "^>" input.fasta | sed -e "s/>//" > headers.txt
grep -oP "(?<=^>).*$" input.fasta > headers.txt

The second one uses a fancy Regular Expression term called a “positive lookbehind”.

Extract information from tables¶

Say I have a table, data.csv, that looks like this:

seq1,Coleoptera,425
seq2,Diptera,256
seq3,Hemiptera,786

If I wanted to find the sequence names that are Coleoptera, I could do this:

grep "Coleoptera" data.csv | cut -d, -f1 > headers.txt

Or the sequences where column three is greater than 500:

awk -F',' ' $3 > 500 ' data.csv | cut -d, -f1 > headers.txt

The functions awk or grep do the searching, while cut extracts the first column from the result. If your table is tab-delimited, you can skip the -F argument in awk and the -d argument in cut, tab-delimited is the default for both.

Extract sequence by header¶

Say we have a list of sequence headers and we want only those sequences from a larger (unwrapped!) FASTA.

grep --no-group-separator -A1 -F -f headers.txt sequences.fasta > output.fasta

By default grep prints only lines that match, the -A1 adds one line after each match as well (this is why it must be unwrapped!). The -F means the search looks for fixed text strings, not RegEx, and the -f looks for a list of search terms supplied in a file.

We can of course pipe this with some of our previous examples. Note that the standard input to grep is the file to be searched, so we need to redirect standard input elsewhere:

grep "Coleoptera" data.csv | cut -d, -f1 | grep --no-group-separator -A1 -f -F /dev/stdin sequences.fasta > output.fasta