4. Concatenating alignments

Introduction

For phylogeny we need to consolidate our genetic data into one file with all of the data for each original sequence. We concatenate the 13 gene files to form a superalignment comprising all of our sequences. This is also known as a supermatrix, since every sequence is now the same length and thus each base is a cell in a very large table, where the rows are different source specimens and the columns are base positions. It is this data that phylogenetic reconstruction will work on.

Software and data

The input data for this tutorial is a directory of gene alignments in FASTA format, as produced in the previous tutorial.

This section uses the catfasta2phyml.pl script.

Performing concatenation

We use the catfasta2phyml.pl command to concatenate the aligned files into a supermatrix.

Exercise

  • As always, check out the helpfile before running. We want to force concatenation of all files even when number of taxa differ, and we want to output a fasta.
  • See if you can figure out the command, then run it.

Solution

dir should of course be replaced with the name of the directory containing your alignments, and supermatrix.fasta with a sensible name.

catfasta2phyml.pl -c -fasta dir/* > supermatrix.fasta

The standard catfasta2phyml.pl command will print the partitions to the terminal. If we want to save them as a file, we can run the below code.

Solution

Note that we print the partitions to the terminal after saving them; this is because if we have any errors, they will also be printed to the partitions.txt file so we want to check that out to ensure our command ran OK.

catfasta2phyml.pl -c -fasta dir/* > supermatrix.fasta 2> partitions.txt
cat partitions.txt

Next steps

Now that we have a supermatrix, we can finally move onto 5 Tree Building