Fundamentals: FASTQ files¶

Introduction¶

This page reviews a few basic points about sequence file structure and how to explore these files using the command line. If you have experience working with FASTQ files you may want to move straight onto the demultiplexing section.

Data and software

This tutorial works with FASTQ format sequence data that contains indices at the beginning of the reads. The example data for this can be found in the 0_rawsequences directory within the sectionA archive. If you haven’t already, you should copy this directory over to your working directory as follows:

cp -r path/to/exampledata/sectionA/0_rawsequences/ ./

Exploring FASTQ files¶

Run the following to change into the directory containing the starting data and list its contents, showing sizes.

cd 0_rawsequences/
ls -lh

We can see how many lines in each file using the word count wc function, specifying we want the number of lines:

wc -l *.fastq

The *.fastq here means we want all of the files ending in .fastq in the directory. We could replace this with a single file name if we just wanted to count the lines of a single file. Replace file in the following command with the name of a single FASTQ file.

head -n 10 file

You will see the FASTQ format comprising header, sequence and quality scores. A useful point to note is the structure of the file header, specifically that it starts with @D00. If the structure of this file is completely new to you, take a few minutes to read the first section on the FASTQ wikipedia page.

To get specific lines from a file, use the sed function:

sed -n '4,8p' file     # prints lines 4-8

Use this to have a look at some different files

Note that the R1 and R2 files from the same library have the same read headers, apart from a 1 or 2 in the second part of the name. Reads with the same header were read from the same location on the sequencer, so they are assumed to be the forward and reverse read of the same fragment - these are called mate pairs. It’s important to ensure that both the forward and reverse read for each fragment are always kept present and in the same relative location in the paired files (“in sync”) for some future processes.

We can use the grep -c function to count the number of sequences in a file (again, replace file with the name of one of the files):

grep -c "^@D00" file

If you want to learn more about grep, see here.

Like the wc -l function above, we can run grep on all of our files at once to get the total read numbers for each of our libraries:

grep -c "^@D00" *.fastq

We can see that we’re dealing with about 10,000 - 12,000 reads per library.

Exercise

Do all the libraries have the same R1 and R2 read numbers?