+ - 0:00:00
Notes for current slide

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Notes for next slide



An Introduction to Genome Assembly



last_modification Updated:   purlPURL: gxy.io/GTN:S00033

Tip: press P to view the presenter notes | arrow-keys Use arrow keys to move between slides
1 / 37

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 37

question Questions

  • How do we perform a very basic genome assembly from short read data?
3 / 37

objectives Objectives

  • assemble some paired end reads using Velvet

  • examine the output of the assembly.

4 / 37

De novo Genome Assembly

With thanks to T Seemann, D Bulach, I Cooke and Simon Gladman

5 / 37

De novo assembly

The process of reconstructing the original DNA sequence from the fragment reads alone.

  • Instinctively like a jigsaw puzzle

    • Find reads which "fit together" (overlap)
    • Could be missing pieces (sequencing bias)
    • Some pieces will be dirty (sequencing errors)

Graphic of a shattered, human-like egg sitting on a wall, dressed in a suit. Several men stand around him attempting to piece back together shattered fragments of the egg.

6 / 37

Another View

A stack of newspapers is labelled genomic DNA. A line labelled points an image of a room full of shredded paper and people inside, labelled reads. Then the line continues to a pile of newspaper clippings reading draft genome sequence, and finally to closed genome sequence with a cartoon of a newspaper.

7 / 37

Assembly: An Example

8 / 37

A small "genome"

The text "Friends, romans, countrymen, lend me your ears;". A small drawing of Shakespeare has a speech bubble reading "I'll return them tomorrow"

9 / 37

Shakespearomics

A set of reads is shown, with various subsets of the sentence like "ds, romans, count" and "friends, rom" and "crymen, lend me". The c in crymen is highlighted yellow, as it should be trymen (from countrymen.) The drawing of Shakespeare now says "Oops! I dropped them."

10 / 37

Shakespearomics

The reads are shown again, now with overlaps below, reconstructing the sentence from the fragments. Shakespeare says I'm good with words. Crymen and "send me your ears" have their first letters highlighted in yellow due to their typos.

11 / 37

Shakespearomics

Finally a "majority consensus" is shown below the overlaps, in two other reads we saw count and countrymen, in addition to our crymen. So that makes 2/3 that have the correct text, and we go with the majority. The same is done for the other typo. Shakespeare says We have a consensus!

12 / 37

So far, so good!

13 / 37

The Awful Truth

A meme image showing boromir from lord of the rings. The text reads: one does not simply assemble a genome.

"Genome assembly is impossible." - A/Prof. Mihai Pop

14 / 37

Why is it so hard?

  • Millions of pieces
    • Much, much shorter than the genome
    • Lots of them look similar
  • Missing pieces
    • Some parts can't be sequenced easily
  • Dirty Pieces
    • Lots of errors in reads

A picture of a jigsaw puzzle titled "The world's most difficult" and showing a field of small round candies. It boasts the same artwork on both sides.

15 / 37

Assembly recipe

  • Find all overlaps between reads
    • Hmm, sounds like a lot of work..
  • Build a graph
    • A picture of the read connections
  • Simplify the graph
    • Sequencing errors will mess it up a lot
  • Traverse the graph
    • Trace a sensible path to produce a consensus
16 / 37

Reads are provided to the algorithm, they are in the colours of the rainbow. Next overlaps are identified and the rainbow resolves itself. A subset of that is highlighted and points to reads connected by overlaps with many arrows going between the bluegreen fragments that are highlighted. This goes to the hamiltonian path identified with a re-run arrow between, indicating some mount of backtracking needed to find the best path. Finally the hamiltonian produces a consensus sequence with the correct final ordering.

17 / 37

A more realistic graph

A graph showing maybe 500 nodes connected with messy lines, it is intentionally impossible to read and a mess to highlight the scope of the problem.

18 / 37

fun with a strike through it. What ruins the graph?

  • Read errors

    • Introduces false edges and nodes
  • Non haploid organisms

    • Heterozygosity causes lots of detours
  • Repeats

    • If they are longer than the read length
    • Causes nodes to be shared, locality confusion.
19 / 37

Repeats

20 / 37

What is a repeat?

A segment of DNA which occurs more than once in the genome sequence

  • Very common
    • Transposons (self replicating genes)
    • Satellites (repetitive adjacent patterns)
    • Gene duplications (paralogs)

Three human children wearing similar shirts. One reads I was planned, one I was not, and the third Me neither.

21 / 37

Effect on Assembly

A genome with a repeat in two distinct locations is shown. Arrows point to the repeats being collapsed, and then the in-between bits being cut out of the sequence completely.

22 / 37

The law of repeats A picture of the ocean with text reading repeat after me.

It is impossible to resolve repeats of length S unless you have reads longer than S

It is impossible to resolve repeats of length S unless you have reads longer than S

23 / 37

Scaffolding

24 / 37

Beyond contigs

Contig sizes are limited by:

  • the length of the repeats in your genome
    • Can't change this
  • the length (or "span") of the reads
    • Use long read technology
    • Use tricks with other technology
25 / 37

Types of reads

Example fragment

atcgtatgatcttgagattctctcttcccttatagctgctata

"Single-end" read

atcgtatgatcttgagattctctcttcccttatagctgctata

sequence one end of the fragment

"Paired-end" read

atcgtatgatcttgagattctctcttcccttatagctgctata

sequence both ends of the same fragment

We can exploit this information!

26 / 37

Scaffolding

  • Paired end reads

    • Known sequences at each end of fragment
    • Roughly known fragment length
  • Most ends will occur in same contig

  • Some will occur in different contigs

    • evidence that these contigs are linked
27 / 37

Contigs to Scaffolds

A scaffold with gaps as yellow boxes is shown. Above is a set of contigs and paired-end reads shown bridging the gaps.

28 / 37

Assessing assemblies

  • We desire

    • Total length similar to genome size
    • Fewer, larger contigs
    • Correct contigs
  • Metrics

    • No generally useful measure. (No real prior information)
    • Longest contigs, total base pairs in contigs, N50, ...
29 / 37

The "N50"

The length of that contig from which 50% of the bases are in it and shorter contigs

  • Imagine we have 7 contigs with lengths:

    • 1, 1, 3, 5, 8, 12, 20
  • Total

    • 1+1+3+5+8+12+20 = 50
  • N50 is the "halfway sum" = 25

    • 1+1+3+5+8+12 = 30 (>25) so N50 is 12
30 / 37

2 levels of assembly

  • Draft assembly

    • Will contain a number of non-linked scaffolds with gaps of unknown sequence
    • Fairly easy to get to
  • Closed (finished) assembly

    • One sequence for each chromosome
    • Takes a lot more work
    • Small genomes are becoming easier with long read tech
    • Large genomes are the province of big consortia (e.g. Human Genome Consortium)
31 / 37

How do I do it?

32 / 37

Example

  • Culture your bacterium
  • Extract your genomic DNA
  • Send it to your sequencing centre for Illumina sequencing
    • 250bp paired end
  • Get back 2 files
    • MRSA_R1.fastq.gz
    • MRSA_R2.fastq.gz
  • Now what?
33 / 37

Assembly tools

  • Genome
    • Velvet, Velvet Optimizer, Spades, Abyss, MIRA, Newbler, SGA, AllPaths, Ray, SOAPdenovo, ...
  • Meta-genome
    • Meta Velvet, SGA, custom scripts + above
  • Transcriptome
    • Trinity, Oases, Trans-abyss

And many, many others...

34 / 37

Assembly Exercise #1

  • We will do a simple assembly using Velvet in Galaxy
  • We can do a number of different assemblies and compare some assembly metrics.
35 / 37

keypoints Key points

  • We assembled some Illumina fastq reads into contigs using a short read assembler called Velvet

  • We showed what effect one of the key assembly parameters, the k-mer size, has on the assembly

  • It looks as though there are some exploitable patterns in the metric data vs the k-mer size.

36 / 37

Thank You!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!

Author(s) orcid logoSimon Gladman avatar Simon Gladman
Reviewers Helena Rasche avatarNicola Soranzo avatarSaskia Hiltemann avatarCristóbal Gallardo avatarWilliam Durand avatarNiall Beard avatarBérénice Batut avatar
Galaxy Training Network

Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.

37 / 37

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 37
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow