name: inverse layout: true class: center, middle, inverse
---
# High Performance Computing for Pairwise Genome Comparison
Esteban Perez-Wohlfeil
last_modification
Updated:
purl
PURL
:
gxy.io/GTN:S00068
text-document
Plain-text slides
|
Tip:
press
P
to view the presenter notes |
arrow-keys
Use arrow keys to move between slides
??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off Press `C` to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other. Useful when presenting. --- ## Requirements Before diving into this slide deck, we recommend you to have a look at: - [Introduction to Galaxy Analyses](/training-material/topics/introduction) --- ### <i class="far fa-question-circle" aria-hidden="true"></i><span class="visually-hidden">question</span> Questions - What is sequence comparison and why does it matter? - What are the challenges and how can we use High Performance Computing to overcome them? - How can we use Galaxy for large scale sequence comparison? --- ### <i class="fas fa-bullseye" aria-hidden="true"></i><span class="visually-hidden">objectives</span> Objectives - Learn the basics of genome comparison - Run sequence comparisons in Galaxy - Using GECKO for sequence comparison - Using CHROMEISTER for comparing large sequences such as plants and chromosomes --- ### What is sequence comparison? - Finding string matches between text paragraphs - Comparing biological sequences can be seen as an instance of common string matching problem but with some particularities - Restricted alphabets: *e.g.* DNA = \{A, C, G, T\} - It adds an additional dimension - instead of looking for only one pattern, we have to look for all possible patterns! --- ### Why is it importantg? - Sequence comparison is a core problem of bioinformatics that allows us to study: - Evolution (Where do we come from? And other species?) - Structure (What are we made of? How does our body work?) - Functionality (How do changes in our DNA affect us?) - Widely used in several fields! - Whole genome alignment, gene prediction, phylogenomics, pharmacogenetics, metagenomics, etc. --- ### What are the challenges? - A lot of progress has taken place since the beginning (around 1960s) - So what is the problem? - New technology produces huge amounts of data - Optimal sequence comparison has complexity O(n<sup>2</sup>), which is a lot when sequences are long - Several ways to compare sequences, each having its advantages: DP-based, gapped and ungapped, alignment-free - DNA contains lots of repeats and low complexity regions, which make it difficult to find the conserved signal - And also makes it very hard for computational algorithms --- ### Why is sequence comparison hard? - The complexity of the problem grows as the length of the sequences becomes larger - That means that for small sequences it is OK; but for large sequences the problem can take a lot of time - Computing power doubles every two years according to Moore’s law but sequence databases grow purely exponential every year Sequence type | Approx. length --- | --- Gene | ~1,000 to 10,000 bp Bacteria genome | ~10<sup>6</sup> bp Mammalian chr. | ~10<sup>8</sup> bp Mammalian genome | ~2 * 10<sup>9</sup> bp Plant genome | ~3 * 10<sup>9</sup> bp Large plant genomes | ~2 * 10<sup>10</sup> bp Amoeba genome | ~6 * 10<sup>11</sup> bp --- ### Exponential growth of the Sequence Read Archive  - Other databases have experienced similar growth too --- ### Methods for sequence comparison - Dot-matrix comparison (no alignments) - Comparing all letters with each other (extremely slow/noisy) - Alignment-free methods - Similar to the previous but typically based on k-mers (words) - Aligning methods - Exhaustive dynamic programming algorithms (Slow but optimal) - Local alignment - Global alignment - Seed-based heuristics - Words of fixed size are used to start alignments (Fast but less sensitivity). Can also include gaps. --- ### Dot-matrix methods - Takes each letter in both sequences and compare it to all the rest - Takes quadratic time and space! - Does not produce alignments, but rather just represents a comparison - The **x** and **y** axis represent the sequences being compared  --- ### Comparing genomes with dotplots  --- ### How useful are raw dotplots? .pull-left[ - Dot matrix methods do not take into account the global information or context - Noise - Where is the signal? ] .pull-right[.image-70[]] --- ### How do we include alignments? .pull-left[ - Alignments have to take into account their surroundings! - Use dynamic programming to compute an accumulated score - Match = 4 - Miss = -4 ] .pull-right[ .image-70[  ] ] --- ### How do we include alignments? Cont. I .pull-left[ - Alignments have to take into account their surroundings! - Use dynamic programming to compute an accumulated score - The score of each cell comes from either the diagonal (no penalty but match/mismatch score) or from a row or column, which is a gap in either the query or reference sequence ("jumping" gap penalty is included then) - Match = 4 - Miss = -4 - **Gap = -iG - eG*L** ] .pull-right[  ] --- ### How do we include alignments? Cont. II  --- ### How do we include alignments? Cont. III - But what if the two sequences are not toy examples? - Typically, a mammalian chromosome is around the 100 Mbp mark - Worst case scenario this algorithm takes quadratic time and space - Imagine comparing two amoeba genomes (Not your usual case, though) - Total number of cells: 10<sup>22</sup> \~the number of sand grains in the earth --- ### How do we include alignments? Cont. IV  --- ### How to speed up finding alignments? - DP algorithms are not feasible for long sequences - Use seeds of fixed size (k-mers) to find anchor points and then perform the alignment - Can be done in linear time - But careful, the number of common seeds can be huge and optimization is still needed --- ### How to speed up finding alignments? Cont. I  --- ### How to speed up finding alignments? Cont. II - Extend the common seeds identified previously to form larger fragments - These are called High-scoring Segment Pairs and do not typically include gaps - Some programs (such as BLAST) include small gaps - This approach also enables to compare distant sequences with many evolutionary rearrangments since fragments do not require to be close to each other - Similar result to the local dynamic programming algorithm (as opposed to the global approach where everything is connected) --- ### How to speed up finding alignments? Cont. III  --- ### Recap on Sequence comparison - Sequence comparison is inherently hard because it requires quadratic time and space to produce optimal alignments - Especially when there are more (and longer) genome sequences available everyday ([Ensembl blog](http://www.ensembl.info/category/01-release/)) - Heuristic approaches based on seeds are required to cope with the complexity of dynamic programming methods - Tools are needed to quickly compare sequences while providing insightful information --- ### Methods - Now that we have a general background on sequence alignment, lets jump to the software that we will use in this tutorial - GECKO: fine-grained sequence comparison that uses secondary memory to enable virtually comparisons of any size - GECKO Multi-Genome-Viewer: A tool for the interactive inspection of sequence comparisons - CHROMEISTER: Ultra-fast approach for the previsualization of sequence comparisons while dealing with noise and repeats, including full mammalian genomes and plants --- ### GECKO - Pairwise genome comparison software which reduces execution time in pairwise and multiple genome comparison studies - It uses an 'out of core' strategy, *i.e.* it runs on secondary memory (disk) as opposed to RAM - Can run anywhere, about ~4 GB of RAM and ~1 TB of disk required for the longest sequences - Faster than state-of-the-art software especially for longer sequences such as chromosomes --- ### The GECKO algorithm  --- ### The GECKO algorithm (Cont. I) - A dictionary is computed for each sequence containing the positional information of each word (possible seed) - Once each dictionary is sorted, perfect matches between words produce a set of seeds (alignment candidates) - Seeds are then sorted (by their diagonal position **x**<sub>start</sub> - **y**<sub>start</sub>) and filtered - Finally, the seeds are extended (up and downstream) to generate a set of High-scoring Segment Pairs (HSPs) --- ### Parallelising GECKO .pull-left[ - In a static distribution cores can become idle if they finish before other cores → resources are wasted - GECKO uses MPI to distribute the workload dynamically based on the length - When a core (worker) has finished executing a task (comparison) the master node broadcasts more work - This reduces overall makespan time because of workload balance ] .pull-right[  ] --- ### Parallelising GECKO (Cont. I) .pull-left[ - GECKO also employs a second level of parallelism besides using MPI to distribute data dynamically - It consists on sorting the initial seeds found (which is the most time-consuming step) using shared memory threads (pthreads and openMP) - Each core uses up to `t` threads to sort the dictionaries containing the seeds ] .pull-right[  ] --- ### GECKO-MGV: interactive sequence comparison - GeckoMGV is a Web-based application aimed to visualize results from multiple genome comparison allowing deep data analysis. It features: - User friendly interface - Multilayer for displaying and overlaying comparisons - Interactive zooming and filtering - External and proprietary post-processing services - Adaptable from equivalent software (*e.g.* MAUVE) - Dendrograms and Multiple Sequence Alignment visualization --- ### GECKO-MGV: interactive sequence comparison (Cont. I)  --- ### GECKO-MGV: interactive sequence comparison (Cont. II) - A typical exercise would include running a sequence comparison with GECKO - Visualizing the alignments with GECKO-MGV - Interactively selecting regions of interest and performing post-processing - Such as selecting repeats, extracting them and performing multiple sequence alignment to find out single point mutations --- ### GECKO-MGV: interactive sequence comparison (Cont. III)  --- ### CHROMEISTER: Motivation .pull-left[ - GECKO is suitable for chromosome comparisons - But all vs all chromosomes between any two species generate usually over 400 comparisons - This results in extremely large computation times - Do all of these even have any similarity signals? - Instead of runnning everything from scratch with GECKO, first use CHROMEISTER to detect significant comparisons ] .pull-right[  ] --- ### CHROMEISTER: Motivation (Cont. I) .pull-left[ - Comparisons tend to be full of repetitions in large mammalians, let alone plants! - State-of-the-art methods get stuck handling large number of repetitions or resulting dotplots are too noisy - Can we extract any information from raw seeds (hits) without having to generate fragments (slower)? ] .pull-right[  ] --- ### CHROMEISTER: Motivation (Cont. II)  --- ### CHROMEISTER: Methods - CHROMEISTER (CHROmosome MEISTER) is an ultra fast heuristic approach to computing extremely large pairwise genome comparison in desktop PCs - Hybrid indexing approach - Heuristically select long, unique and inexact hits! - Probabilistic filtering - Higher concentration of unique hits within HSPs - Apply a kernel to improve visualization --- ### CHROMEISTER: Methods (Cont. I) - Heuristically select unique and inexact seeds: - Hits with frequency above one are repetitions by definition (in a computational sense) so we remove them - Short k-mers produce repetitions and noise, so we have to use long k-mers - But long k-mers produce less seeds due to biological evolution, therefore a certain inexactitude is allowed - Are these two words equal? Under what considerations?  --- ### CHROMEISTER: Methods (Cont. II) - Probabilistic filtering - The probability of a sequence region containing more hits will be higher if there was an existing HSP! - The discrete dotplot is downsampled to a smaller representation to group hits into blocks (similar to HSPs) - Kernel and score calculation - A kernel is convoluted with the dotplot to improve visualization and remove single repetitions - The scoring distance is calculated between the peak hit points using taxicab distance: $$d\_{raw} = \sum\_{i}^{l-1}taxicab(max(H\_m(i), H\_m(i+1)))$$ $$d = \frac{d\_{raw}}{l^2}$$ --- ### CHROMEISTER: Examples  --- ### CHROMEISTER: Conclusions .pull-left[ - Produces a dotplot visualization in minutes using 1 core and 1 GB of RAM - It is used in the plants community - This is due mostly to its ability to filter repeats and to handle large sizes - Also used for analysis of contig/scaffold reordering - (Right) Pairwise genome comparison between Aegilops tauschii (3.2 Gbps) and Triticum aestivum (4.2 Gbps) in under 16 minutes using 1 core and 1 GB of RAM ] .pull-right[  ] --- ### <i class="fas fa-key" aria-hidden="true"></i><span class="visually-hidden">keypoints</span> Key points - Sequence comparison is difficult due to its computational complexity and the growth of databases - There are several approaches to sequencen comparison, and we need to know which one to use (DP-based, alignment-free, seeds, etc.) - We have learnt the internals regarding GECKO, GECKO-MGV and CHROMEISTER to work with single chromosome comparisons and/or exhaustive searchs, the post-processing and for larger/noisy experiments, respectively --- ## Thank You! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://training.galaxyproject.org) and all the contributors!
Author(s)
Esteban Perez-Wohlfeil
Reviewers
Tutorial Content is licensed under
Creative Commons Attribution 4.0 International License
.