5. Point error filtering

Introduction

Filtering by length will remove sequences that have one or more PCR/sequencer-caused insertions or one or more deletions, however in some cases these errors may cancel one another out; or alternatively, PCR or sequencing may induce the equivalent of point mutations, where a single base is misread. Similarly, noncoding gene variants such as NUMTs or pseudogenes may actually have point mutations in comparison to the ‘true’ region.

We can identify some point errors because they will alter the translation of the genetic code in such a way that it becomes meaningless - if the barcode region is a coding region, of course. The obvious error is the introduction of stop codons into the translation. By translating all of our sequences and checking for stop codons, we can easily reject these errors or variants.

Data and software

The input data for this tutorial is a FASTA file comprising unique sequences (ASVs). If you’re following along step-by-step, this was produced in the previous tutorial. Alternatively, the file 7_mbc_indelfil.fasta within the sectionB archive can be used as example data.

This tutorial uses an accessory function, filtertranslate, from the metaMATE software.

Filtering by translation

Check the helpfile for this script by running:

filtertranslate --help

Exercise

  • Figure out what the command is to run filtertranslate with all of the following options:
  1. using automatic reading frame detection
  2. outputting both succeeding and failing sequences in separate files
  • Hint: check the usage line to figure out where some of the arguments go. Don’t forget, our samples are arthropods.

Solution

filtertranslate -i input.fasta -t 5 -y separate -o output

Exercise

  • Have a look at the failed file.
  • Go to an online amino acid translator (e.g. here) and paste in a sequence. Make sure to set the correct genetic code.
  • See what the translation looks like. Frame 2 is the correct frame.
  • Can you see the stop(s)?

Other ‘point errors’ that do not cause stops are harder to spot. Some will not affect coding at all, which is impossible to distinguish from natural variation. The majority will affect coding, but again distinguishing these from natural variation is very hard. One possibility is to look at the structure of the translated protein and see if it’s realistic, but there aren’t currently any tools that can do this…

Next steps

We now have a set of ASVs that are all of the correct length, with a lot of errors hopefully removed. The final thing to do is remove chimeras, which we will do in the next tutorial: 6. Chimera filtering.