Phylogenetics - Back to Basics - Estimating trees from alignments

Contributors

Michael Charleston

last_modification Published: May 10, 2024

last_modification Last Updated: May 10, 2024

Likelihood

A way to select models, given data

Why Likelihood?

Statistical rigour! Likelihood requires a statistical model and gains scientific rigour to defend hypotheses made.
Comparing hypotheses! We can compare models of evolution and determine which is better supported by the data.
Accuracy! Likelihood is generally the most accurate method to estimate phylogenies.

Likelihood Definition

The likelihood of a model is a quantity that is proportional to the probability that the model gave you the data.

It is easy(ish) to calculate the probability that a given model yields a particular outcome – the data set.

Maximum Likelihood (ML) seeks to find the model for which this probability is the highest.

Dice example

Imagine I have a set of very strange 10-sided dice, where the sides of my .blue[blue] die are labelled (1,1,2,3,4,5,5,6,6,6), and the sides of the .red[red] die are labelled (1,1,2,2,3,3,4,4,5,6).

Both dice have the same range of possible outcomes, 1 through 6, but the relative probabilities are different.

score	P.red[red]	P.blue[blue]
1	0.2	0.2
2	0.1	0.2
3	0.1	0.2
4	0.1	0.2
5	0.2	0.1
6	0.3	0.1
Total	1.0	1.0

Dice example

Suppose I choose a die and roll it three times, getting values 3, 6, and 2.

Which die is most likely to have been used?

If it were the .blue[blue] die then the probability of this outcome is \(0.1\times 0.3 \times 0.1 = 0.003\).

If it were the .red[red] die then this probability is \(0.2\times 0.1 \times 0.2 = 0.004\).

Neither of these probabilities is high! But the larger of the two is for the .red[red] die: thus, given the data, we would say the maximum likelihood estimate of the “model” (here, which die was used) is the .red[red] die.

Slightly formal, with Bayes

We could write this example more formally as

Data \(D\) is the outcome \((3,6,2)\);
\(P(\color{blue}{B})\) and \(P(\color{red}{R})\) are the probabilities of choosing the blue and red dice (say 0.5 each);

\(P(D

\color{blue}{B})\) and \(P(D

\color{red}{R})\) are the probabilities of the data given the model “blue die” and “red die”;

\(P(\color{blue}{B} D)\) is the probability of the model given the data, and it can be written

\( P(\color{blue}{B}|D) = \frac{P(D|\color{blue}{B})P(\color{blue}{B})}{P(D)} = \frac{P(D|\color{blue}{B})P(\color{blue}{B})}{P(D|\color{blue}{B})P(\color{blue}{B})+P(D|\color{red}{R})P(\color{red}{R})} \)

– this is Bayes’ formula.

Similarly we can write \(P(\color{red}{R} D)\) as

\( P(\color{red}{R}|D) = \frac{P(D|\color{red}{R})P(\color{red}{R})}{P(D)} = \frac{P(D|\color{red}{R})P(\color{red}{R})}{P(D|\color{blue}{B})P(\color{blue}{B})+P(D|\color{red}{R})P(\color{red}{R})} \)

These conditional probabilities can be thought of as the relative likelihoods of the model, given the data.

Important Concepts learned

The dice example was very simple but it showed us important concepts:

We could turn around the probability of the data given a model to the probability of a model, given the data;
Of the models considered, it was possible to choose one that was more likely;
These likelihoods can be really small, and don’t have to sum to 1 (so they’re not probabilities really).

Likelihood values in practice

In phylogenetics the probability of an alignment given a tree and its probabilities become really tiny so in order to avoid underflow numerical errors we use the natural log of the probabilities, and call it the log-likelihood.

Probability of a substitution

Now suppose we have a probability matrix whose entries are the probabilities of nucleotide substitution in some fixed time, say 1 million years:

.image-50[ Formula where the probability of nucleotide substitutions P is expressed in a matrix with the nucleotides AGCT on both the x and y axis. The probability of changing from any nucleotide to another is 0.01. Probability of not changing is 0.97. Long description provided in the video recording at 11:20. ]

We could then calculate the probability of going from one nucleotide to another, say A to G, in 1M years by reading it from the graph:

\( Pr(A\to G) = P_{A,G} = 0.01. \)

Probability of sequences changing

To calculate the probability of multiple nucleotides changing we multiply the individual probabilities together – which is making direct use of the assumption that each nucleotide site evolves independently of the others.

We would then calculate the probability of going from sequence AAGT to AAGA as

\[\begin{align} prob & = P(A\to A)\times P(A\to G) \times P(G\to G) \times P(T\to A) \\\\ & = 0.97 \times 0.01 \times 0.97 \times 0.01 \\\\ & \approx 9.4\times 10^{-6} \end{align}\]

Changing the time

To calculate the probability of sequences changing in our toy example for say two million years, we would square the probability matrix \(P\):

.image-50[ Formula where the probability of sequences changing over two million years P2 is expressed in a matrix with the nucleotides AGCT on both the x and y axis. The probability of changing from any nucleotide to another is 0.0196. Probability of not changing is 0.9412.
Long description provided in the video recording at 13:12. ]

Notice that the probability of going from A to G is not quite twice what it was before, since now there is a small chance that A will change to G and then back again.

So if we have a probability matrix for a fixed time period we can convert it to a probability matrix for multiples of that time.

Probabilities to Rates

This doesn’t generalise to arbitrary time periods, so instead we use a rate matrix, like this:

.image-50[ Formula where the rate of nucleic acid subtitutions Q is expressed as a matrix with the nucleotides AGCT on both the x and y axis. The rate of change from any nucleotide to another is 0.01. The rate of not changing is -0.03. Long description provided in the video recording at 15:56 ]

Converting a rate matrix to a probability requires a matrix exponentiation, giving formulae like this:

\( P(t) = e^{Qt}. \)

We won’t worry here about technical details, but note that some classes of rate matrix are easier to work with than others, the easiest being the JC69 (one parameter) matrix.

Likelihood uses rate matrices

Many models of sequence evolution available, including different base (nucleotide) frequencies, and different sites being allowed different rates.
Likelihoods will be expressed as a (negative) log-likelihood, and these are used to compare models.
Typically these will be quite large (e.g., -37000, -80000), but it is the relative likelihoods that matter, so the difference between log likelihoods.

A huge range of models

More parameter-rich models fit better; we need to avoid over-fitting.
The (log)-likelihood value can be adjusted to penalise the model, based on the number of parameters.
Common penalties are the Akaike Information Criterion (AIC).
There is also a “corrected” AIC (AICc), and a “Bayesian Information Criterion” penalty (BIC).

Penalised likelihood

For a model with \(k\) free parameters and a log likelihood \(L\), here are the penalty functions:

AIC is the original information-theoretic correction: \( AIC = 2k-2\ln(L)\)

AICc is a correction for small sample sizes and is equal to \( AICc = AIC + \frac{2k^{2}+2k}{n-k-1} \)

BIC is defined by \( BIC = k\ln(n) - 2\ln(L)\) (where \(n\) is a measure of sample size).

All of these are in common use but the commonest is AIC, and that is the method we use in the tutorial.

Phylogenetic likelihood is mostly directionless

The likelihood calculations in this tutorial are based on reversible models: that means that there is no difference between the likelihood of going from sequence A to sequence B of from B to A.

This means that the output trees are unrooted.

(There are some asymmetric models included in IQTree but they are well beyond the scope of this tutorial and are not in common use.)

Likelihood of a tree and model

Now we have the huge task:

ML Search:
	for every tree $T$ in tree space {
		for each parameter in the substitution rate matrix {
			for each branch length {
				Calculate the (log-)likelihood of the alignment
			}
		}
	}

… and find the best combination!

All this means that Maximum Likelihood is slow.

Likelihood under JC69

.image-50[ Formula where rate of nucleic acid substitutions according to Jukes-Cantor 69 model QJC69 are expressed as a matrix with the nucleotides AGCT on both the x and y axis. The rate of change from any nucleotide to another is represented by the character alpha. The rate of not changing is represented by an asterisk. Long description provided in the video recording at 26:47 ]

This model has only one parameter.
Note: we cannot separate this from the overall rate (just as you can’t tell people how “far” somewhere is by assuming how fast they’ll travel).
Thus tree search is “only” over all trees and all branch lengths.

Likelihood under HKY85

The HKY85 model has four parameters: \(\pi_{A}\), \(\pi_{G}\), \(\pi_{C}\) (and this fixes \(\pi_{T}\)), and a transition/ transversion rate ratio \(\kappa\).
Can fix \(\pi\) or estimate it from the observed frequencies, or even estimate it by searching for the best values (which means it takes longer of course).

GTR

The General Time Reversible model has nine parameters: three from the base frequencies and six more from the parameters in the substitution rate matrix:

.image-50[ Formula where rate of nucleic acid substitutions according to the GTR model QGTR is expressed as a matrix with the nucleotides AGCT on both the x and y axis. The rate of change from any nucleotide to another incorporates nine parameters represented by Greek letters. The rate of not changing is represented by an asterisk. Long description provided in the video recording at 30:00 ]

(The Greek letters are not standardised: they just show the structure.)

Rates across sites

Some sites are non-coding: they evolve faster than others.
Third codon position sites are largely redundant: for most amino acids, the third nucleotide is free to vary.
Some are under strong selection and evolve slowly.
Some are so important that they must stay in the same state else a protein will not fold properly: these are fixed.

This means that even if the same rate matrix might be applicable at different sites, the overall substitution rate might be hundreds of times faster at one site than another.

Models that accommodate this are called “Rates Across Sites” or RAS models.

.blue[By default, IQTree includes RAS models.]

Rates across sites

Schematic representing differences in the rate of substitutions at different sites. A line with two dots on it represents two positions within a gene. Blue arrows connect each dot to separate rate matrices. The matrices are identical but the matrix on the right is annotated 5x indicating that there are 5 times more substitutions at this site. Described at 33:00

While the proportions of substitution types might be the same, the overall rate at these two sites can differ.

IQTree

IQTree is at the forefront of maximum likelihood phylogenetic tools. It is by far the most advanced such tool and is under active development, adding new models and features.

It is available from IQtree.org.

To cite IQTree2, please use

B.Q. Minh, H.A. Schmidt, O. Chernomor, D. Schrempf, M.D. Woodhams, A. von Haeseler, R. Lanfear (2020) IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol., 37:1530-1534. https://doi.org/10.1093/molbev/msaa015

Tree Space

A way to navigate among trees —

What is “tree space”?

whose nodes are trees, and
whose edges are adjacencies between trees, where
each node has some score or weight: here, it will be likelihood.

.left[ As we cannot check every tree we must wander through the set of trees in the hope of finding “good” ones, under the assumption that the best (most likely) trees will have a lot in common with each other.]

.left[ Fortunately this turns out to be the case! The best trees tend to clump together in tree space.]

Tree perturbations 1: NNI

Nearest Neighbour Interchange is a way to transform one tree into another – so we can search for the best tree.

An NNI move consists of effectively collapsing an internal branch of the tree and re-expanding it out in either of two possible ways.

For example, given the original tree

.image-25[ Schematic of an unrooted phylogenetic tree. There are three connected internal nodes x,y,z and five tips A, B, C, D, E. A and B are connected to x; C is connected to Y; E and D are connected to z. ]

there are two internal edges: \(x-y\) and \(y-z\). We can do an NNI move on each of them.

Tree perturbations 1: NNI

Schematic of an unrooted phylogenetic tree. There are three connected internal nodes x,y,z and five tips A, B, C, D, E. A and B are connected to x; C is connected to Y; E and D are connected to z. The edge connecting x and y is coloured red.

Select edge \(xy\) to collapse.

Tree perturbations 1: NNI

\(xy\) is collapsed.

Tree perturbations 1: NNI

Expand \(xy\) in either of two ways.

For an unrooted binary tree with \(n\) leaves, there are \(2(n-3)\) neighbouring trees under this move.

Simple example: five taxa

Fifteen different unrooted binary phylogenetic trees containing five taxa are arranged in a star shaped configuration. Blue lines connect the trees and represent NNI moves. Long description provided at 40:00 in the video recording.

Tree perturbations 2: SPR

Subtree Pruning and Regrafting works by choosing an edge, cutting the tree at that edge, and then reconnecting the smaller part anywhere else in the larger part.

Tree perturbations 2: SPR

Schematic of an unrooted phylogenetic tree. The tree has four connected internal nodes w,x,y,z and six tips A, B, C, D, E, F. A and B are connected to w; C is connected to x; D is connected to y; E and F are connected to z. There is a black dot on the edge connecting w and z.

Trim the root of the subtree…

Tree perturbations 2: SPR

Regraft to \(y-D\),

Tree perturbations 2: SPR

… or \(y-z\),

Tree perturbations 2: SPR

… or \(z-E\),

Tree perturbations 2: SPR

… or \(z-F\).

Under this move there are more neighbours: \(4(n-3)(n-2)\) possible moves.

Tree perturbations 3: TBR

Tree Bisection and Reconnection works by cutting the tree at any internal edge, then re-rooting each of the parts just formed, and reconnecting the roots:

Tree perturbations 3: TBR

Schematic of two phylogenetic trees. In the tree on the left node w is connected to nodes A, B and C. In the tree on the right node z is connected to nodes D, E and F. —

Tree perturbations 3: TBR

Schematic of two phylogenetic trees. In the tree on the left node w is connected to nodes A, B and C. In the tree on the right node z is connected to nodes D, E and F. Black dots are shown on the edges connecting A-w and D-z.

Selecting edges as roots of the new subtrees

Tree perturbations 3: TBR

And reconnect!

Note: There is no neat form for the number of neighbours of a given tree under the SPR moves, but it’s more than TBR.

Photograph of desert landscape with multiple hills. From https://en.wikipedia.org/wiki/File:Painted_Hills_2009.08.13.11.08.52.jpg

Painted Hills near Mitchell, Oregon.

This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

Hill Climbing

Tree Search algorithms work essentially by

Starting with a reasonably good estimate, e.g., using Neighbour-Joining;
Checking all the neighbouring trees under whichever perturbation is being used for an acceptable new tree;
If an acceptable new tree is available, move to it! If not, then stop.
- Most simply, “acceptable” means “better than this one”.
  
  This kind of heuristic search is called hill climbing, from the obvious analogy.

Two networks containing 15 phylogenetic trees are shown. Lines connect the trees in the networks forming a star shaped pattern. The network on the right is additionally annotated with lines that trace over selected connections between trees multiple times. Long description in video recording at 47:17 Image from: https://doi.org/10.1101/746362

This figure from Cao et al. (2019) illustrates moving around in tree space.

Source: https://doi.org/10.1101/746362

Anolis phylogeny example

.pull-left[ Photograph of a green lizard ] .pull-right[ Anolis is a highly speciose genus of lizards – more than 425 species (unless as some would have it, many should be moved to another genus) – native to the Americas.

.reduce50[ Graphic: “Picture of a lizard (Anolis carolinensis) I (DanielCD) took in Atascocita, Texas (USA) on 4/22/05. Atascocita is in southeast Texas just north of Houston.”

This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license. ] ] —

`iqtree` output

mac@mac-pc-443 ~/N/p/t/2/AusBioCommons-Phylogenetics> iqtree2 -s anolis-aligned.fasta
IQ-TREE multicore version 2.1.3 COVID-edition for Linux 64-bit built Apr 21 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    mac-pc-443 (AVX2, FMA3, 31 GB RAM)
Command: iqtree2 -s anolis-trimmed-aligned-clustal.fasta -redo
Seed:    225340 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Thu Dec 21 13:26:51 2023
Kernel:  AVX+FMA - 1 threads (8 CPU cores detected)

HINT: Use -nt option to specify number of threads because your CPU has 8 cores!
HINT: -nt AUTO will automatically determine the best number of threads to use.

Reading alignment file anolis-trimmed-aligned-clustal.fasta ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
Alignment has 55 sequences with 1462 columns, 1138 distinct patterns
913 parsimony-informative, 172 singleton sites, 377 constant sites
                            Gap/Ambiguity  Composition  p-value
   1  Anolis.acutus                 3.56%    passed     50.37%
   2  A.aeneus                      5.54%    failed      2.25%
   3  A.agassizi                    3.01%    failed      1.82%
   4  A.ahli                        3.15%    passed     81.11%
   5  A.aliniger                    2.74%    passed     43.47%
   6  A.alutaceous                  2.94%    passed      9.34%
   7  A.angusticeps                 2.80%    passed     57.88%
   8  A.bahorucoensis               2.53%    passed     21.76%
   9  A.barahonae                   3.90%    failed      1.72%
  10  A.bartschi                    3.76%    passed     30.22%
  11  A.bimaculatus                 4.65%    passed     32.77%

] —

Create initial parsimony tree by phylogenetic likelihood library (PLL)... 0.010 seconds
Perform fast likelihood tree search using GTR+I+G model...
Estimate model parameters (epsilon = 5.000)
Perform nearest neighbor interchange...
Estimate model parameters (epsilon = 1.000)
1. Initial log-likelihood: -37100.382
2. Current log-likelihood: -37098.639
Optimal log-likelihood: -37097.515
Rate parameters:  A-C: 1.41395  A-G: 4.11730  A-T: 1.55773  C-G: 0.69506  C-T: 7.28507  G-T: 1.00000
Base frequencies:  A: 0.339  C: 0.261  G: 0.118  T: 0.282
Proportion of invariable sites: 0.186
Gamma shape alpha: 0.832
Parameters optimization took 2 rounds (0.144 sec)
Time for fast ML tree search: 0.993 seconds

] —

ModelFinder will test up to 286 DNA models (sample size: 1462) ...
 No. Model         -LnL         df  AIC          AICc         BIC
GTR+F         41407.688    115 83045.375    83065.197    83653.445
GTR+F+I       39252.694    116 78737.387    78757.569    79350.744
GTR+F+G4      37164.452    116 74560.904    74581.085    75174.261
GTR+F+I+G4    37096.959    117 74427.917    74448.462    75046.562
GTR+F+R2      37597.901    117 75429.802    75450.347    76048.447
GTR+F+R3      37179.249    119 74596.499    74617.780    75225.718
GTR+F+R4      37074.898    121 74391.795    74413.828    75031.590
GTR+F+R5      37037.210    123 74320.421    74343.219    74970.791
GTR+F+R6      37031.233    125 74312.465    74336.043    74973.410
SYM+R5        37262.481    120 74764.963    74786.618    75399.470
SYM+R6        37261.627    122 74767.254    74789.668    75412.336
TVM+F+R5      37097.645    122 74439.290    74461.704    75084.373
TVM+F+R6      37095.642    124 74439.285    74462.471    75094.942
TVMe+R5       37745.606    119 75729.213    75750.494    76358.432
TVMe+R6       37742.081    121 75726.162    75748.195    76365.957
TIM3+F+R5     37076.842    121 74395.685    74417.718    75035.480
TIM3+F+R6     37070.459    123 74386.918    74409.717    75037.288
TIM3e+R5      37747.554    118 75731.107    75752.019    76355.039
TIM3e+R6      37740.393    120 75720.786    75742.442    76355.293
TIM2+F+R5     37042.461    121 74326.921    74348.954    74966.716
TIM2+F+R6     37036.463    123 74318.927    74341.725    74969.297
TIM2e+R5      37274.306    118 74784.612    74805.524    75408.544
TIM2e+R6      37273.302    120 74786.604    74808.259    75421.111
TIM+F+R5      37081.068    121 74404.136    74426.169    75043.931
TIM+F+R6      37074.732    123 74395.464    74418.262    75045.834
TIMe+R5       37757.456    118 75750.913    75771.824    76374.845
TIMe+R6       37750.820    120 75741.641    75763.296    76376.148
TPM3u+F+R5    37129.051    120 74498.103    74519.758    75132.610

] —

Estimate model parameters (epsilon = 0.100)
1. Initial log-likelihood: -37042.461
Optimal log-likelihood: -37042.422
Rate parameters:  A-C: 1.77793  A-G: 5.08796  A-T: 1.77793  C-G: 1.00000  C-T: 8.61510  G-T: 1.00000
Base frequencies:  A: 0.339  C: 0.261  G: 0.118  T: 0.282
Site proportion and rates:  (0.321,0.030) (0.225,0.316) (0.232,1.114) (0.196,2.482) (0.026,6.784)
Parameters optimization took 1 rounds (0.163 sec)
Computing ML distances based on estimated model parameters...
Computing ML distances took 0.054833 sec (of wall-clock time) 0.054830 sec(of CPU time)
Computing RapidNJ tree took 0.000283 sec (of wall-clock time) 0.000282 sec (of CPU time)
Log-likelihood of RapidNJ tree: -37150.315

] —

-------------------------------------------------------------------
|             INITIALIZING CANDIDATE TREE SET                      |
--------------------------------------------------------------------
Generating 98 parsimony trees... 0.879 second
Computing log-likelihood of 98 initial trees ... 1.283 seconds
Current best score: -37042.422

Do NNI search on 20 best initial trees
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 1: -37042.003
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 2: -37036.294
Iteration 10 / LogL: -37042.212 / Time: 0h:0m:5s
Iteration 20 / LogL: -37058.987 / Time: 0h:0m:6s
Finish initializing candidate tree set (10)
Current best tree score: -37036.294 / CPU time: 6.458
Number of iterations: 20

]

--------------------------------------------------------------------
|               OPTIMIZING CANDIDATE TREE SET                      |
--------------------------------------------------------------------
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 27: -37036.066
Iteration 30 / LogL: -37044.404 / Time: 0h:0m:9s (0h:0m:31s left)
Iteration 40 / LogL: -37036.365 / Time: 0h:0m:11s (0h:0m:26s left)
Iteration 50 / LogL: -37036.142 / Time: 0h:0m:14s (0h:0m:22s left)
Iteration 60 / LogL: -37038.691 / Time: 0h:0m:16s (0h:0m:19s left)
Iteration 70 / LogL: -37039.589 / Time: 0h:0m:19s (0h:0m:15s left)
Iteration 80 / LogL: -37049.077 / Time: 0h:0m:21s (0h:0m:12s left)
Iteration 90 / LogL: -37036.399 / Time: 0h:0m:23s (0h:0m:9s left)
Iteration 100 / LogL: -37036.108 / Time: 0h:0m:26s (0h:0m:7s left)
Iteration 110 / LogL: -37036.508 / Time: 0h:0m:28s (0h:0m:4s left)
Iteration 120 / LogL: -37036.242 / Time: 0h:0m:31s (0h:0m:1s left)
TREE SEARCH COMPLETED AFTER 128 ITERATIONS / Time: 0h:0m:33s

]

--------------------------------------------------------------------
|                    FINALIZING TREE SEARCH                        |
--------------------------------------------------------------------
Performs final model parameters optimization
Estimate model parameters (epsilon = 0.010)
1. Initial log-likelihood: -37036.066
Optimal log-likelihood: -37036.065
Rate parameters:  A-C: 1.78140  A-G: 5.12638  A-T: 1.78140  C-G: 1.00000  C-T: 8.65177  G-T: 1.00000
Base frequencies:  A: 0.339  C: 0.261  G: 0.118  T: 0.282
Site proportion and rates:  (0.320,0.029) (0.222,0.311) (0.230,1.085) (0.200,2.454) (0.027,6.674)
Parameters optimization took 1 rounds (0.074 sec)
BEST SCORE FOUND : -37036.065
Total tree length: 10.598

Total number of iterations: 128
CPU time used for tree search: 32.804 sec (0h:0m:32s)
Wall-clock time used for tree search: 33.018 sec (0h:0m:33s)
Total CPU time used: 33.400 sec (0h:0m:33s)
Total wall-clock time used: 33.500 sec (0h:0m:33s)

Analysis results written to:
  IQ-TREE report:                anolis-aligned.fasta.iqtree
  Maximum-likelihood tree:       anolis-aligned.fasta.treefile
  Likelihood distances:          anolis-aligned.fasta.mldist
  Screen log file:               anolis-aligned.fasta.log

Date and Time: Thu Dec 21 13:28:52 2023

]

Screenshot of a phylogenetic tree of Anolis species. The root of the tree is at the left of the image and is connected to Anolis species on the right by multiple branches.

]

Beyond Maximum Likelihood Trees

Bayesian analyses with BEAST and MrBayes
Networks – next video!

Thank you!

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! page logo

Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.

Phylogenetics - Back to Basics - Estimating trees from alignments

Contributors

Likelihood

Why Likelihood?

Likelihood Definition

Dice example

Dice example

Slightly formal, with Bayes

Important Concepts learned

Likelihood values in practice

Probability of a substitution

Probability of sequences changing

Changing the time

Probabilities to Rates

Likelihood uses rate matrices

A huge range of models

Penalised likelihood

Phylogenetic likelihood is mostly directionless

Likelihood of a tree and model

Likelihood under JC69

Likelihood under HKY85

GTR

Rates across sites

Rates across sites

IQTree

Tree Space

What is “tree space”?

Tree perturbations 1: NNI

Tree perturbations 1: NNI

Tree perturbations 1: NNI

Tree perturbations 1: NNI

Simple example: five taxa

Tree perturbations 2: SPR

Tree perturbations 2: SPR

Trim the root of the subtree…

Tree perturbations 2: SPR

Regraft to \(y-D\),

Tree perturbations 2: SPR

… or \(y-z\),

Tree perturbations 2: SPR

… or \(z-E\),

Tree perturbations 2: SPR

Tree perturbations 3: TBR

Tree perturbations 3: TBR

Tree perturbations 3: TBR

Selecting edges as roots of the new subtrees

Tree perturbations 3: TBR

Hill Climbing

Anolis phylogeny example

iqtree output

Beyond Maximum Likelihood Trees

Thank you!

Thank you!

`iqtree` output