Preparing genomic data for phylogeny reconstruction

Creative Commons License: CC-BY Questions:
  • How do I find a set of common proteins (orthologs) across related species or strains?

  • How do I organize a set of orthologs to infer evolutionary relations between species or strains (phylogenetic reconstruction)?

  • Mask repetitive elements from a genome

  • Annotate (predict protein-coding genes) the genomes of the samples to compare

  • Find a set of common proteins across the samples (orthologs)

  • Align orthologs across samples

Time estimation: 3 hours
Supporting Materials:
Published: Oct 21, 2022
Last modification: Oct 28, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL:
version Revision: 22

A robust and well-resolved phylogenetic classification is essential to understand genetic relationships within and between species and the evolution of their phenotypic diversity. In the last decade the genomic revolution has represented a drastic change in the amount of data used for phylogenetic inference. The single-gene approach using universal phylogenetic markers for the different lineages across the tree of life, is now being replaced by the assembly of taxon-rich and genome-scale data matrices, the so called phylogenomic approach.

Molecular sequence data can be used to construct a phylogeny by comparing differences between nucleotide or amino acid sequences across species or strains, a technique called phylogenomics. Young and Gillung 2019 have written a comprehensive review on the topic of phylogenomics.

In this tutorial we prepare genetic sequence data for phylogenetic reconstruction, using sequences from chromosome 5 of five strains of the yeast Saccharomyces cerevisiae.

To prepare the data, we will:

  1. Predict protein coding genes from the genome using Funannotate
  2. Find the proteins that are present in more than one genome, called orthologs, using Proteinortho and extract a subset with orthologs that are present in all samples.
  3. Align each set of orthologs using ClustalW.

The resulting dataset is ready to be used for phylogenetic reconstruction.

If you are starting from sequence reads, please follow An Introduction to Genome Assembly, and the appropriate genome assembly training for your sequencing technology from GTN’s Assembly section


In this tutorial, we will cover:

  1. Get data
  2. Protein coding gene prediction
    1. Mask repetitive sequences
    2. Annotate with Funannotate
    3. Extract ORFs into FASTA files
  3. Find orthologs
    1. Quality control with Busco
    2. Extract proteins with Proteinortho
  4. Align ortholog sequences
  5. Conclusion

Get data

For this training we will use a subset of the genome (chromosome 5) from four strains of S. cerevisiae. The GenBank annotated sequenced were produced using ‘funannotate predict annotation’ (Galaxy Version 1.8.9+galaxy2) on the nucleotide sequences sequences.

Hands-on: Data upload
  1. Create a new history for this tutorial
  2. Import the files from Zenodo or from the shared data library (GTN - Material -> ecology -> Preparing genomic data for phylogeny reconstruction):
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Optional: Rename each dataset to its accession number followed by ‘.genome.fasta’ or ‘.genbank’.
  4. Group the datasets into collections. These will ease data handling and help minimize the clutter in your history. Make a collection of nucleotide sequences and another of protein sequences.

    • Click on galaxy-selector Select Items at the top of the history panel Select Items button
    • Check all the datasets in your history you would like to include
    • Click n of N selected and choose Build Dataset List

      build list collection menu item

    • Enter a name for your collection
    • Click Create List to build your collection
    • Click on the checkmark icon at the top of your history again

    Creating a simple collection

Protein coding gene prediction

Mask repetitive sequences

Before we can annotate the genome, we will prepare the data by masking repetitive sequences in the genome. Repeat-rich regions can interfere with genome annotation tools. In this step we find and soft-mask repetitive regions in the genome. The annotation tool can then take this information into account (see RepeatMasker tutorial for more details). We use RepeatMasker Smit et al. 2013, a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. This program has historically made use of RepBase (Kohany et al. 2006), a service of the Genetic Information Research Institute, but this database in no longer open access. Instead, we will use Dfam (Storer et al. 2021) an open collection of Transposable Element DNA sequence alignments, HMMs derived from Repbase sequences and consensus sequences. For this reason, the annotation of repetitive sequences might be incomplete.

RepeatMasker will only accept compact fasta headers. Before we can mask repetitive regions with RepeatMasker we must trim the NCBI long header (BK006939.2 TPA_inf: Saccharomyces cerevisiae S288C chromosome V, complete sequence) to leave only the accession number (>BK006939.2) by using a regular expression.

Hands-on: Mask repetitive sequences
  1. Replace Text ( Galaxy version 1.1.2) with the following parameters:
    • param-collection “File to process”: output (Input dataset collection)
    • In “Replacement”:
      • param-repeat “Insert Replacement”
        • “Find pattern”: (>[^ ]+).+
        • “Replace with:”: \1


          \1 replaces the text in the header with the first matched text in the header, the accession number in this case.

  2. RepeatMasker ( Galaxy version 4.1.2-p1+galaxy0) with the following parameters:
    • param-file “Genomic DNA”: outfile (output of Replace Text tool)
    • “Repeat library source”: DFam (curated only, bundled with RepeatMasker)
      • “Select species name from a list?”: No
        • “Repeat source species”: "Saccharomyces cerevisiae"

    If you don’t select the species from the list, you must provide a species name (“Saccharomyces cerevisiae” in this example). In principal, all unique clade names occurring in NCBI taxonomy database can be used for species. Capitalization is ignored, multiple words need to bound by apostrophes. Not all “common” English names occur in the taxonomy database. Using Latin names is always safest.

  3. Inspect the ‘RepeatMasker masked sequence on data’ output file. Scroll down and you will find stretches of ‘N’ on the location of repetitive sequences. The file is now ready for annotation with Funannotate.

Annotate with Funannotate

We will predict protein-coding genes from genomic sequences using Funannotate (Young and Gillung 2019), which collects evidence from different ab-initio gene predictors as well as from RNA-seq or ESTs data. Funannotate has been developed for Fungi but it works with any Eukaryotic genome. The output of Funannotate is a list of ORFs and their translation in GenBank format.


If you would like to learn about genome annotation in more depth, the GTN has a section dedicated to training on genome annotation, including a hands-on tutorial on Funannotate.

Warning: Slow Step Ahead!

Even for a small dataset, Funannotate can take a very long time to run. You can skip this step and use the Genbank files downloaded from Zenodo for the following step. These were generated using Funannotate as described in the hands-on below.

Hands-on: Annotate genome
  1. Funannotate predict annotation ( Galaxy version 1.8.9+galaxy2) with the following parameters:
    • param-file “Assembly to annotate”: output_masked_genome (output of RepeatMasker tool)
    • In “Organism”:
      • “Name of the species to annotate”: Saccharomyces cerevisiae
      • “Is it a fungus species?”: Yes
      • “Ploidy of assembly”: 1
    • In “Evidences”:
      • “Select protein evidences”: Use UniProtKb/SwissProt (from selected Funannotate database)

    If available, include mRNA and/or ESTs in this evidence section to increase sensitivity of predictions.

    • In “Busco”:
      • “BUSCO models to align”: saccharomycetes
      • “Initial Augustus species training set for BUSCO alignment”: saccharomyces
    • In “Augustus settings (advanced)”:
      • “Minimum number of models to train Augustus”: 15

    When annotating full genomes, increase the number of ‘Minimum number of models to train Augustus’ to an appropriate value. For the small sample dataset used here, values larger than 15 will result in failure.

  2. Inspect the output GenBank file. The FEATURES section contains the genome annotation of protein predictions, their location and their translation. Each predicted protein is given a unique ID, which will become the FASTA header in the next step.

Extract ORFs into FASTA files

In this step, we extract the protein sequences from the GenBank files into multi-FASTA files. Additionally, we modify the headers to include the accession number of the sample. This creates an unique accession-proteinID header for each predicted protein, and will allow us to retrieve them after we combine all predicted proteins into one multi-FASTA file.

So far, we have kept sequence files in a collection. Now we will combine all predicted proteins from all samples into one big multi-fasta file. Later, we will retrieve ortholog sequences from this file.

Hands-on: Extract ORFs
  1. Extract ORF ( Galaxy version 3.02) with the following parameters:
    • param-file “gene bank file”: annot_gbk (output of Funannotate predict annotation tool - or the genbank files downloaded from Zenodo)
  2. Regex Find And Replace ( Galaxy version 1.0.1) with the following parameters:
    • param-file “Select lines from”: aa_output (output of Extract ORF tool)
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: >([^ ]+).+
        • “Replacement”: >#{input_name}_\1
  3. Collapse Collection ( Galaxy version 4.2) with the following parameters:
    • param-file “Collection of files to collapse into single dataset”: out_file1 (output of Regex Find And Replace tool)
    • “Prepend File name”: Yes
  1. Is the number of predicted ORFs the same across the samples?
  1. No, the number of predicted ORFs ranges from 193 to 199.

Find orthologs

Orthologs are genes in different species evolved from a common ancestral gene by a speciation (lineage-splitting) event and contain the information needed for building phylogenies. The input for this step is the collection of multi-FASTA extracted from the GenBank file and modified to have unique headers. The result file ‘orthology-groups’ contains one row per orthogroup and one column per sample.

Next, we select orthogroups where all the species are represented by only one protein, 1:1 single copy orthologs (a total of 4 proteins per orthogroup for this dataset).

Hands-on: Find and filter orthologs
  1. Proteinortho ( Galaxy version 6.0.14+galaxy2.9.1) with the following parameters:
    • param-file “Select the input fasta files (>2)”: out_file1 (output of Regex Find And Replace tool)
    • “Activate synteny feature (POFF)”: no
  2. Filter with the following parameters:
    • param-file “Filter”: proteinortho (output of Proteinortho tool)
    • “With following condition”: c1==4 and c2==4
  3. Inspect the ‘orthology-groups’ tabular file.
  1. Inspect the the ‘orthology-groups’ tabular file.

    1. Do any samples contain more than one gene for any given orthogroup?
    2. What is the name given to these homologous genes within the same genome?
    1. Yes, CM000925.1 contains two genes on the first orthogroup.
    2. Paralogs. These are gene copies created by a duplication event within the same genome.
  2. Look at the filtered orthogroups.

    1. How many orthogroups are represented once only in all four samples?
    1. 173

Quality control with Busco

Busco is a dataset of nearl-universal, single-copy orthologs. Here we use it for quality control of the assembly and annotation produced above. It outputs a proxy of completeness, duplication and fragmentation of the annotation (Busco can also be used to assess the completeness of a genome assembly).

Hands-on: QC with Busco
  1. Busco ( Galaxy version 4.1.4) with the following parameters:
    • param-file “Sequences to analyse”: out_file1 (output of Regex Find And Replace tool)
    • “Mode”: Proteome
    • “Lineage”: Saccharomycetes
    • In “Advanced Options”:
      • “Augustus species model”: Use the default species for selected lineage

    Make sure you select the ‘lineage’ closest to the species(s) you are analyzing.

  1. The Complete (‘C’) metric stands for complete Busco genes identified. Look at the Busco output file ‘Short summary’ for sample BK006939.2. What is the ‘C’ number?
  2. Why is it so low?
  1. 4.0%
  2. Our dataset contains only chromosome 5 of the yeast genome.

Extract proteins with Proteinortho

Next we extract 1:1 single copy orthologs and generate one multi fasta file per ortholog.

Hands-on: Extract protein sequences
  1. Proteinortho grab proteins ( Galaxy version 6.0.14+galaxy2.9.1) with the following parameters:
    • param-file “Select the input fasta files”: output (output of Collapse Collection tool)
    • “Query type”: orthology-groups output file
      • param-file “A orthology-groups file”: out_file1 (output of Filter tool)

The output is a collection of multi-fasta ortholog files. All species are represented in each file and are ready to be aligned.

Align ortholog sequences

Alignment of sequences allows cross-species (or other taxonomic level, strain, taxa) comparison. An alignment is a hypothesis of positional homology between bases/amino acids of different sequences. A correct alignment is crucial for phylogenetic inference. Here we use ClustaW, a well-established pairwise sequence aligner.

First we modify the headers of the multi-fasta file, such that only the sample name is retained. This is important for future conatenation of alignment into a supermatrix (see Maximum likelihood GTN training)

Hands-on: Align orthologs with ClustalW
  1. Regex Find And Replace ( Galaxy version 1.0.1) with the following parameters:
    • param-file “Select lines from”: listproteinorthograbproteins (output of Proteinortho grab proteins tool)
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: (>[^_]+).+
        • “Replacement”: \1
  2. ClustalW ( Galaxy version 2.1) with the following parameters:
    • param-file “FASTA file”: out_file1 (output of Regex Find And Replace tool)
    • “Data type”: Protein sequences
    • “Output alignment format”: FASTA format
    • “Output complete alignment (or specify part to output)”: Complete alignment
  1. Open the ClustalW output ‘queryOrthoGroup121.fasta’ and its corresponding multifasta input. You can compare them side by side activating the ‘Scratchbook’ on the top panel. What is the main difference between the sequences in the unaligned multifasta (input) and the ClustalW output multifasta?
  1. Two of the sequences from this orthogroup are truncated (early stop codon). The alignment program inserts ‘-‘ to represent indels in the alignment.


In this tutorial, you have prepared genome sequence data for phylogenetic analysis. First, you have extracted information from these in the form of predicted protein sequences. You then grouped the predicted proteins into orthogroups and aligned them. These aligned sequences can now be used for reconstructing a phylogeny and building a phylogenetic tree.