View markdown source on GitHub

High Performance Computing for Pairwise Genome Comparison




last_modification Published: Feb 8, 2021
last_modification Last Updated: Jul 26, 2021

What is sequence comparison?

Why is it importantg?

What are the challenges?

Why is sequence comparison hard?

Sequence type Approx. length
Gene ~1,000 to 10,000 bp
Bacteria genome ~106 bp
Mammalian chr. ~108 bp
Mammalian genome ~2 * 109 bp
Plant genome ~3 * 109 bp
Large plant genomes ~2 * 1010 bp
Amoeba genome ~6 * 1011 bp

Exponential growth of the Sequence Read Archive

line chart going from 2009 to 2013 with number of bases growing from 10e11 to 10e15, and all bases slowly growing larger than open access, but open access still being the vast majority. A pie chart is overlaid showing the majority is Illumina GA, with SOLiD making up 10% and others a tiny slice.

Methods for sequence comparison

Dot-matrix methods

Montage of three dot plots

Comparing genomes with dotplots

A dotplot is shown, with lines going from bottom left to the top right. Part way through it is fractured and a line is seen in the opposite direction showing an inversion. There is noise scattered around the plot from accidental overlaps. An inset shows some repeats which appear as regular small diagonal lines, repeated in x and y axes.

How useful are raw dotplots?


.pull-right[.image-70[A tiny dot plot is shown, with two sequences and dots scattered across the plot. No pattern emerges due to repeating letters in some sequences.]]

How do we include alignments?



.pull-right[ .image-70[ A table with sequence across top left and right are shown, some boxes are filled with 4 and -4 according to match or mismatches. ] ]

How do we include alignments? Cont. I



.pull-right[ The same chart as before, but more cells are filled in, the numbers get mostly increasingly negative. ]

How do we include alignments? Cont. II

A filled out chart and then several boxes are highlighted showing a best scoring set of matches using a typical algorithm for sequence alignment. The alignment is shown at the left.

How do we include alignments? Cont. III

How do we include alignments? Cont. IV

A larger subset of a table is shown from rows 0 to 100 billion. The text grows quadratically is shown over the table.

How to speed up finding alignments?

How to speed up finding alignments? Cont. I

The same matrix is shown again, but now black diagonal lines appear, circled as k-mers of fixed size.

How to speed up finding alignments? Cont. II

How to speed up finding alignments? Cont. III

Multiple aligned k-mers are grouped into larger runs of sequence and labelled high-scoring pairs.

Recap on Sequence comparison



The GECKO algorithm

Sequence X and Y enter the pipeline and produce two dictionary and dictionary hashes. These are combined into a hits file which shows a dotplot with several circles in diagonals. These are then sorted, filtered, and produce a set of HSPs which show the same dots connected up into diagonal lines.

The GECKO algorithm (Cont. I)

Parallelising GECKO



.pull-right[ A genomes list enters the workflow and is mapped into a workload file which gives all combinations of G1 to GN. This is sent to a job runner and the individual pairwise comparisons are distributed across worker nodes. ] —

Parallelising GECKO (Cont. I)


.pull-right[ Two sequences, G1 and G2 enter, produce dictionaries, and hits which are then combined into sorted Hits and fragmented hits. ] —

GECKO-MGV: interactive sequence comparison

GECKO-MGV: interactive sequence comparison (Cont. I)

Multiple pairwise alignments are shown on the left, in the center is a large dot-plot, and below is a distribution matrix with a large scatter plot. On the right a minimap and various filtering options are shown.

GECKO-MGV: interactive sequence comparison (Cont. II)

GECKO-MGV: interactive sequence comparison (Cont. III)

The same application is shown with a set of repeats highlighted, and pulled out into a table and at the bottom a sequence logo is shown with the multiple sequence alignment.



.pull-right[ A large multi-genome dot-plot is shown. ]

CHROMEISTER: Motivation (Cont. I)


.pull-right[ An even larger, denser dot plot is shown with a very low signal to noise ratio. ]

CHROMEISTER: Motivation (Cont. II)

Two 3d plots are shown, the first resembles the previous dot plot with high noise, and lots of tiny peaks. An arrow points to the second set of "filtered seeds" with several diagonal sets of peaks highlighted.


CHROMEISTER: Methods (Cont. I)

Two sequences 1 and 2 are shown. Each has a single nucleotide change at the start and end.

CHROMEISTER: Methods (Cont. II)

\[d_{raw} = \sum_{i}^{l-1}taxicab(max(H_m(i), H_m(i+1)))\] \[d = \frac{d_{raw}}{l^2}\]


A low signal to noise ratio dot plot plot labelled Macaca Mulatta Chr X vs Homo Sapiens Chr X is reduced to a diagonal line via CHROMEISTER under 2 minutes. Below an Aegilops tauschii Chr.1 vs Triticum aestivum Chr 1 produced with NUCMER is reduced to a very clear dot plot with CHROMEISTER.

CHROMEISTER: Conclusions


.pull-right[ A multi-genome dot plot is shown comparing two genomes. It is very clean and low-noise. ]

Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.