3. Frequency filtering / denoising

Introduction

Due to the use of PCR to generate metabarcoding datasets, we should generate plenty of copies of each real variant of the amplicon of interest, and thus many reads - depending on sequencing depth, of course. We can therefore be relatively confident that ASVs that occur at very low frequencies in the dataset are more likely to be errors.

Software and data

The input data for this tutorial is a FASTA file comprising all of the unique sequences (ASVs) in your dataset. If you’re following along step-by-step, this was produced in the previous tutorial. Alternatively, the file 5_mbc_derep.fasta within the sectionB archive can be used as example data.

This tutorial uses the VSEARCH software.

Simple frequency filtering

We can apply a simple threshold, keeping only ASVs that are present at equal to or greater than a given number of reads in the entire dataset. A standard threshold here would be 2, i.e. any singletons will be removed. This is a very straightforward function to apply, once again this could be performed by any number of different software tools but we will use VSEARCH because we probably have it installed already. Run the below command, remembering to replace input.fasta with the input data (see the software and data box above) and output.fasta with a sensible name.

vsearch --fastx_filter input.fasta --minsize 2 --fastaout output.fasta

This is a straightforward filter, but it doesn’t take into account all of the realities of amplification and sequencing. If an error arose in the later stages of PCR, or occured when sequencing a relatively rare ASV, then it’s reasonable that errors may be infrequent, but if an error arose in the early stages of PCR then it would have been amplified many times.

Denoising

A much more sophisticated approach to filtering errors is denoising. Denoising is crucial to modern metabarcoding pipelines, especially as increased work with Metazoan communities shows the scale of erroneous ASVs present in datasets. This is just one way that denoising can be performed, albeit the most common algorithm used. Denoising algorithms use read frequency and sequence composition to infer likely sequencing errors. ASVs are compared against one another and frequency thresholds are applied relative to counts of similar (but more common) ASVs. Instead of doing the size filtering as part of dereplication, we will instead do it as part of a denoising command. We will use the UNOISE3 algorithm, implemented again in VSEARCH. Run the following command, remembering to replace input.fasta with the input data (see the software and data box above) and output.fasta with a sensible name.

vsearch --cluster_unoise input.fasta --minsize 4 --unoise_alpha 2 --centroids output.fasta

The key parameter here is the alpha parameter, which determines the threshold level of dissimilarity between frequent and infrequent reads for exclusion of infrequent reads. Note that we’re using a less conservative minsize threshold than the default of 8 because of the smaller size of this example dataset.

Exercise

  • How does the number of sequences filtered by this command compare with the output of the simple abundance threshold run before?
  • What is the effect on the number of sequences and size distribution of those sequences of varying the alpha parameter?

You can get the size distribution by running:

grep -oP "(?<=size=).*$" input.fasta | sort | uniq -c

In the output of this command, the second column is the count of reads, and the first column is the count of these counts. So for example, if a row says 24 12, there are 24 ASVs that have 12 reads each.

Selecting denoising parameters

Denoising removed a large proportion of our ASVs, and this might seem like a large amount of data is being thrown away. However, while this removed a lot of unique sequences, the vast majority of them would have been rarer ASVs, and relative to the total number of reads in the dataset relatively few were discarded. Unfortunately, denoising is one of these steps where we cannot choose thresholds and parameter based on an a priori expectation, the settings must be tuned to the dataset and the research questions.

As a baseline, it is sensible to always discard singletons. Beyond this, if the dataset is large, the research questions are tolerant to the loss of rare real sequences and/or it is important that as high a proportion of ASVs be valid as possible, then stricter settings (higher --minsize, lower --unoise_alpha) would be sensible. If the dataset is small and the preservation of as many valid ASVs as possible is preferred despite the probably retention of errors, then less strict settings should be used. The important point is that the decision is justified by the research questions and the dataset is sufficiently sequenced as to be resilient to the suitable settings.

Next steps

The standard usage of UNOISE works on the entire dataset, thus considers the frequency of ASVs relative to one another over the entire dataset. However, individual samples within a dataset are rarely evenly sequenced, therefore some ASVs may be rarer because they occur in undersequenced samples, not because they are errors. It may be more appropriate to denoise on the level of individual samples. If you’d like to try this out, you can try out this extension: Extension: Denoising by sample. Note it may take a little while so you may choose to come back to this later.

You should have produced many denoised outputs while you were playing around with the thresholds. We suggest you clean these up and proceed using the output from the command given above, i.e. with --minsize 4 --unoise_alpha 2. This file contains a more filtered set of ASVs than we started with, but there are still other sources of error we can address, starting with 4. Indel filtering.