Extension: Denoising by sample¶

Introduction¶

Here we will denoise by each individual sample, rather than on the entire dataset. This requires us to go back to the data generated in an earlier tutorial. We strongly recommend making a separate directory to perform this experimentation in, and using clear file names so you don’t get the data mixed up!

Data and software

The input data for this tutorial is a directory containing one FASTQ or FASTA file per sample. If you’ve been following along step-by-step, this is the data produced in the pair merging tutorial in the previous section. Alternatively, the 3_merged directory within the sectionA archive can be used as example data.

This extension uses the VSEARCH software.

Denoising by sample¶

Exercise

If your input data is FASTQ files, quality filter each FASTA file using the vsearch --fastx_filter... command from the quality filtering tutorial, but running individually on each sample one at a time using a bash loop (hint: there’s a loop command at the end of that tutorial).
Run the dereplication command (vsearch --derep_fulllength...) from the dereplication tutorial on each sample individually using a bash loop
Run the denoising command (vsearch --cluster_unoise...) from the denoising tutorial on each sample individually using a bash loop
Concatenate the result files using sed in the same way as in the data concatenation tutorial.
Re-dereplicate the output file using the following command:

vsearch --derep_fulllength input.fasta --sizein --sizeout --relabel uniq --output output.fasta

Compare the total unique read numbers and size distributions of your output to the version produced by whole-dataset denoising