Bayesian clustering

Introduction

The software CROP was originally designed for 16S and has not been updated in a while. Nonetheless, it uses an interesting algorithm based on assuming sequences are a gaussian mixture and models the process of clustering using Markov-chain Monte Carlo simulations. This approach probably bears more similarity to greedy clustering than to swarm, but is sufficiently different that it is interesting to compare the methods.

Data and software

The input data for this tutorial is a FASTA file comprising unique sequences (ASVs). If you’re following along step-by-step, this was produced at the end of the previous section. Alternatively, the file 9_mbc_final.fasta within the sectionC archive can be used as example data.

This tutorial uses the VSEARCH and CROP software.

Getting started

CROP needs every unique read for each ASV in the input dataset, it cannot read ;size= tags. We can generate this from our input ASVs quite easily though, using the --rereplicate function of VSEARCH.

vsearch --rereplicate input.fasta --output output.fasta

Running CROP

We then use this rereplicated FASTA file in CROP. Note it creates several outputs using output as a name base. You should of course replace this with a sensible name.

Try running the following command:

crop -b 40 -z 400 -s -r 0 -i input.fasta -o output

The options -b, -z and -r are for optimising the MCMC process. You can read more about these in their documentation. You can probably leave them alone unless you’re really interested.

Option ​-s specifies that we want the equivalent of 3% dissimilarity clustering. We could change this to -g which is equivalent to 5%. You can alternatively supply values to -l and -u , for example ​-l 1 -u 3 , which are the lower and upper bounds of similarity levels.

CROP creates a bunch of files, you may want to ​mv​ them to their own folder. You could create a folder for each CROP run you do and make sure you’re in that directory when you run it. The output.cluster.fasta file contains the centroid sequences for each cluster, i.e. the representative OTU sequences.

Exercise

  • Run CROP at the 5% level using the -g option in place of -s.
  • How does the number of OTUs change?
  • Run greedy clustering at the 3% and 5% dissimilarity levels. How do the number of OTUs compare against CROP?

Tracking our ASVs

The outname.cluster.list file that CROP produces is a text file that contains a list of the sequences included in each cluster. Run more to have a look at it. You’ll note that on each line, the same ASV is repeated multiple times. This is because we rereplicated our ASVs earlier, and CROP worked on these ASV reads, each of which had the same name. We’ll return to this file later when we want to map our reads.

Next steps

We’ve generated a set of OTUs using one method. If you haven’t already, try out other types of OTU clustering: linkage delimitation or greedy clustering. You could optionally also look at the extension: phylogenetic delimitation.

If you’ve now tried all of the methods, the comparing OTU delimitation methods tutorial outlines some ways to compare these methods and some hints for selecting a method and parameters.

Once you have a final set of OTUs and an associated record of ASVs within each OTU, you can proceed with these two files to the mapping reads subsection to learn about mapping your reads to your OTUs. Later on, we’ll also use the OTU sequences to generate a phylogeny of our OTUs in the Building OTU phylogeny section, and to retrieve taxonomic identifications and/or classifications for the OTU sequences in the Identifying OTU sequences section.