Comparing OTU delimitation methods¶
Introduction¶
We’ve reviewed 3 different methods of delimiting OTUs, all of which had a range of parameter values that could be applied, so we could have generated a lot of different outcomes. Selecting between these is not straightforward, but can be substantially aided by a clear understanding not only of how many OTUs a given method and parameter set generate, but also what specific sequences are generated by different methods. Here we’ll briefly cover one way you can get to this information, and then we’ll discuss how you might go about choosing the which OTU delimitation method and parameters to use for your own project
Data and software
This tutorial requires that you’ve performed OTU delimitation at least twice, using different methods and/or thresholds, and that for each OTU delimitation you have a FASTA file of OTUs and a text file recording ASV assignment to OTUs. If you’re following along step-by-step, these would have been produced in any of the other tutorials in this subsection.
Alternatively, you can use the following file pairs from the sectionC archive as example data:
- Greedy clustering:
otus_greedy_0.97.fasta
andasvgroups_greedy_0.97.uc
- Linkage delimitation:
otus_linkage_13.fasta
andasvgroups_linkage_13.txt
- Bayesian clustering:
otus_crop_s.fasta
andasvgroups_crop_s.cluster.list
- Phylogenetic delimitation:
otus_bPTP.fasta
andasvgroups_bPTP.txt
This tutorial uses the VSEARCH software.
Getting started¶
Create a new directory and place copies of all of the FASTA files you want to compare, and only these FASTA files, into this directory. I’m going to assume you’ve called this directory delim_outputs/
.
We want to strip out any ;size=
tags in these files, because they’re not necessary for what we’re doing.
for f in delim_outputs/*;
do
sed -i -e "s/;size=.*$//" $f;
done
The -i
in the sed
command modifies the file “inplace”, meaning it saves over the existing version of the file. You should be careful with this option!
Tracking ASVs to OTUs¶
Next we will merge all OTUs into a single file, then search the ASVs against this file. This will give us a table showing which ASVs were used as OTU centroids for each method/parameter.
First, concatenate all of your OTUs. You might notice we’re using a similar method to earlier in the pipeline - this is a great example of where we can find tools in our toolbox that we already understand and apply them to new use cases.
for f in delim_outputs/*; do n=${f#*/}; n=${n%.fasta}; sed -e "/^>/s/$/;sample=$n/" $f >> output.fasta; done
Next, we map these OTUs to the ASVs. This should be the ASV file that you used as the input for all of the OTU delimitation runs you did.
vsearch --search_exact concatenatedOTUs.fasta --db asvs.fasta --otutabout output.tsv
Here we simply search each of the OTU sequences against the ASVs to find which ASV each OTU is based upon. This is known as the OTU centroid sequence, even though not all of the methods necessarily use a centroid-based approach. Think of it as the “representative” sequence for an OTU, it’s generally the most frequent and/or the sequence that is closest to the “average” ASV in an OTU.
You can now download this table and see which ASVs formed OTU centroids for which method and parameter combination. This isn’t the complete picture, of course, because it doesn’t record which ASVs merged into which OTUs for each different method, but it gives a broad overview of how similar different OTU methods are in terms of the OTUs they produce. If you wanted to look more closely at which ASVs formed each OTU in each different method, that’s possible but beyond the scope of these resources. We’d suggest mapping ASVs to OTUs as described within the different methods, and then bringing all of that data together with this table in R.
Exercise
- How does the number of OTUs and composition of centroid sequences vary between methods.
- Are there any patterns you can identify?
- Do the approximately equivalent parameter settings between methods produce similar results? I.e. 97% similarity for greedy clustering, 13 differences for SWARM and
-s
for bayesian delimitation are approximately equivalent.
Selecting methods and parameters¶
There is rarely an easy choice of which method or parameter to use. Generally, we’d recommend four steps you can take to make a decision:
- Read the documentation and any papers about the algorithm carefuly and try to get a good understanding of how it works, what its assumptions are and what the authors recommend or intend. If you know its assumptions, you can compare this with your dataset and this may guide you. Remember that the authors didn’t necessarily design the method with your exact kind of data in mind, so their recommendations may not necessarily be valid for you.
- Read other papers that perform metabarcoding for similar research questions and/or similar taxonomic groups as your project, and see what they choose and why. If in doubt, going with the consensus is generally the most justifiable choice, but you should make it carefully: there is no guarantee that other authors have thought of all possibilities, and there’s a risk that a single method becomes the consensus simply because everyone else is using it
- Try out different options! Hopefully we’ve shown you that it is possible to check lots of different methods and parameters and it is not too difficult to compare and contrast a) the number of OTUs and b) the centroid sequences selected. While this doesn’t necessarily help you make a final choice, you may find that multiple methods and/or parameters produce the same or very similar results, and this can help you narrow things down
- Use positive controls. If you’re still in the earlier stages of your planning, you could make sure to include one or more positive control samples of known composition, ideally with a diversity similar to your study samples. Then, when you get to the OTU delimitation stage you can experiment to see what delimitation method gives you the number of OTUs you expect for this sample.
Next steps¶
Choose a set of OTUs that you think looks reasonable. If you’re stumped, just go with the consensus: the majority of metabarcoders use 97% similarity greedy clustering. Alternatively, if you think this whole OTU delimitation thing is clearly arbitrary and artificial, well, there’s certainly a trend towards this sort of thinking: you could just use the ASVs (just remember from now on we’ll be referring to OTUs only). Whichever you choose, make sure you clearly name this file so you can find it later.
In the next subsection, we’re going to look at how you can generate the ecological data you’ve been waiting for, by finding how many reads of each OTU are present in each sample, and by trying some methods for taxonomic classification and identification of OTUs. This is in the mapping reads subsection.