The grep command

The grep command is a very powerful tool for searching and subsetting a text file. The grep command performs searching using a standardised search language called Regular Expressions, or regex. Regular expressions are short code expressions that detail how to search for some text, generally in order to include varying or unknown sections of text.

Note

Learning Regular Expressions in detail is beyond the scope of these resources, but there are plenty of great tutorials on the web.

  • RegexOne provides a great interactive tutorial for learning the basics of Regular Expressions
  • regular-expressions.info is a good reference site to look up particular expressions
  • regex101.com is a really handy tool to test your RegEx

Here we’ll cover some basic ways we can use grep to help us with bioinformatics. There are two important facts to remember about how grep operates by default:

  • grep searches line by line for a match to the search term
  • any lines that match the term are returned

Data

If you want some FASTQ files to test the below commands on, you can use any of the files in the 0_rawsequences/ directory of the sectionA archive.

Finding a sequence motif

We can use grep to find instances of a particular sequence in a sequence file. For example, we can search these files for a particular index sequence, e.g. AACACC. Remember, replace file.fastq with the name of a FASTQ file.

grep "AACACC" file.fastq

This will print lots and lots of lines, because this is a very common motif. This is too much to handle, so let’s instead just count the number of lines that have a match:

grep -c "AACACC" file.fastq

In this specific case, this index is used for demultiplexing, so we might want to review some sequences to check that it’s in the right place. We’ll just look at the first three sequences (first 12 lines) using the head command, then send that to grep. We’re going to add |$ to our regular expression. | in regular expressions means “or”, and $ means “the end of a line”. So we’re searching for either our sequence of interest, or the end of a line. This means that every line will match (since every line ends), but only our sequenc of interest will be highlighted. The -E tells grep to use extended regular expressions, which is needed to make this search work.

head -n 12 file.fastq | grep -E "AACACC|$"

Note that the | between the head and grep commands means that the output of the head command is being piped (sent) to the input of the grep command. This is a completely different meaning to the use of | as “or” in the regular expression.

Counting sequences

We saw the -c option earlier - this is really useful for counting up the number of sequences in a FASTA or FASTQ file. This is because the structure of these files is very predictable. Each sequence will have one and only one header line, and this line will start with > for FASTA files and @ for FASTQ files. So for FASTA files, we can count the number of sequences by counting the number of lines that start with >:

grep -c "^>" file.fasta

The ^ symbol means “the start of a line”. So the regex ^> means “the start of a line immediately followed by >.

For FASTQ files it’s a little more tricky, because @ is also a symbol used to denote a base quality score, and it’s possible that the first base of a sequence has this score. So if we just searched for the number of lines starting with @, we’d risk overestimating because we’d be counting all of the header lines and all of the quality score lines starting with @. However, if the sequences you’re working with are sequence reads, it’s likely that they have very similar headers, because the sequence matchine has named them. Check this by looking at the first 20 lines of a FASTQ file: you’ll see that they probably all start with the same string, often a sequence machine identifier code.

head -20 file.fastq

You can then add this to the regular expression to search more accurately. For example, say our sequence headers all start with D00

grep -c "^@D00"

Note that this isn’t infallable - with millions of reads, it’s possible that you might have one that has a quality score starting with @D00 - all of these characters are quality score values. If you need to be 100% certain, it’s better to use a tool that explicitely parses FASTQ files.