VGP assembly pipeline: Step by Step

Overview
Creative Commons License: CC-BY Questions:
  • What combination of tools can produce the highest quality assembly of vertebrate genomes?

  • How can we evaluate the quality of the assembly in a reference-free way?

Objectives:
  • Learn the tools necessary to perform a de novo assembly of a vertebrate genome

  • Evaluate the quality of the assembly

Requirements:
Time estimation: 5 hours
Level: Intermediate Intermediate
Supporting Materials:
Published: Jun 4, 2021
Last modification: Feb 9, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00039
rating Rating: 5.0 (1 recent ratings, 3 all time)
version Revision: 177

Advances in sequencing technologies over the last few decades have revolutionized the field of genomics, allowing for a reduction in both the time and resources required to de novo genome assembly. Until recently, second-generation sequencing technologies (also known as next generation sequencing (NGS)) produced highly accurate but short (up to 800bp) reads. Those read extension was not long enough to cope with the difficulties associated with repetitive regions. Today, so-called third-generation sequencing (TGS) technologies, usually known as single-molecule real-time (SMRT) sequencing, have become dominant in de novo assembly of large genomes. TGS can use native DNA without amplification, reducing sequencing error and bias (Hon et al. 2020, Giani et al. 2020). Very recently, Pacific Biosciences introduced high fidelity (HiFi) sequencing, which produces reads 10-25 kbp in length with a minimum accuracy of 99% (Q20). In this tutorial you will use HiFi reads in combination with data from additional sequencing technologies to generate a high-quality genome assembly.

Deciphering the structural organization of complex vertebrate genomes is currently one of the largest challenges in genomics (Frenkel et al. 2012). Despite the significant progress made in recent years, a key question remains: what combination of data and tools can produce the highest quality assembly? In order to adequately answer this question, it is necessary to analyse two of the main factors that determine the difficulty of genome assembly processes: repetitive content and heterozygosity.

Repetitive elements can be grouped into two categories: interspersed repeats, such as transposable elements (TE) that occur at multiple loci throughout the genome, and tandem repeats (TR) that occur at a single locus (Tørresen et al. 2019). Repetitive elements are an important component of eukaryotic genomes, constituting over a third of the genome in the case of mammals (Sotero-Caio et al. 2017, Chalopin et al. 2015). In the case of tandem repeats, various estimates suggest that they are present in at least one third of human protein sequences (Marcotte et al. 1999). TE content is among the main factors contributing to the lack of continuity in the reconstruction of genomes, especially in the case of large ones, as TE content is highly correlated with genome size (Sotero-Caio et al. 2017). On the other hand, TR usually lead to local genome assembly collapse, especially when their length is close to that of the reads (Tørresen et al. 2019).

Heterozygosity is also an important factor impacting genome assembly. Haplotype phasing, the identification of alleles that are co-located on the same chromosome, has become a fundamental problem in heterozygous and polyploid genome assemblies (Zhang et al. 2020). When no reference sequence is available, the state-of-the-art strategy consists of constructing a string graph with vertices representing reads and edges representing consistent overlaps. In this kind of graph, after transitive reduction, heterozygous alleles in the string graph are represented by bubbles. When combined with all-versus-all chromatin conformation capture (Hi-C) data, this approach allows complete diploid reconstruction (Angel et al. 2018, Zhang et al. 2020, Dida and Yi 2021).

The Genome 10K consortium (G10K) launched the Vertebrate Genome Project (Vertebrate Genomes Project (VGP)), whose goal is generating high-quality, near-error-free, gap-free, chromosome-level, haplotype-phased, annotated reference genome assemblies for every vertebrate species (Rhie et al. 2020). This tutorial will guide you step by step to assemble a high-quality genome using the VGP assembly pipeline, including multiple Quality Control (QC) evaluations.

Warning: Your results may differ!

Some of your results may slightly differ from the results shown in this tutorial, depending on the versions of the tools used, since algorithms can change between versions.

Agenda

In this tutorial, we will cover:

  1. Important terms to know
  2. VGP assembly pipeline overview
  3. Get data
    1. Uploading fasta datasets from Zenodo
    2. Uploading fastqsanger.gz datasets from Zenodo
    3. Organizing the data
  4. HiFi reads preprocessing with cutadapt
  5. Genome profile analysis
    1. Generation of k-mer spectra with Meryl
    2. Genome profiling with GenomeScope2
  6. Assembly with hifiasm
    1. hifiasm assembly modes
    2. Assembly evaluation
    3. HiC-phased assembly with hifiasm
    4. Pseudohaplotype assembly with hifiasm
    5. Purging the primary and alternate assemblies
    6. Purging with purge_dups
  7. Scaffolding
    1. Hybrid scaffolding with Bionano optical maps
    2. Evaluating Bionano scaffolds
  8. Hi-C scaffolding
    1. Pre-processing Hi-C data
    2. Generate initial Hi-C contact map
    3. YaHS scaffolding
    4. Evaluate the final genome assembly with Pretext
  9. Conclusion

Important terms to know

Before getting into the thick of things, let’s go over some terms you will often hear when learning about genome assembly. These concepts will be used often throughout this tutorial as well, so please refer to this section as necessary to help your understanding.

Pseudohaplotype assembly: A genome assembly that consists of long-phased haplotype blocks separated by regions where the haplotype cannot be distinguished (often homozygous regions). This can result in “switch errors”, when the parental haplotypes alternate along the same sequence. These types of assemblies are usually represented by a primary assembly and an alternate assembly. (This definition is largely taken from the NCBI’s Genome Assembly Model.)

Primary assembly: The primary assembly is traditionally the more complete representation of an individual’s genome and consists of homozygous regions and one set of loci for heterozygous regions. Because the primary assembly contains both homo- and heterozygous regions, it is more complete than the alternate assambly which often reports only the other set of loci for heterozygous regions. Thus, the primary assembly is usually what one would use for downstream analyses.

Alternate assembly: The alternate assembly consists of the alternate loci not represented in the primary assembly (heterozygous loci from the other haplotype). These types of sequences are often referred to as haplotigs. Traditionally, the alternate assembly is less complete compared to the primary assembly since homozygous regions are not represented.

Phasing: Phasing aims to partition the contigs for an individual according to the haplotype they are derived from. When possible, this is done by identifying parental alleles using read data from the parents. Locally, this is achieved using linkage information in long read datasets. Recent approaches have managed to phase using long-range Hi-C linkage information from the same individual (Cheng et al. 2021).

Assembly graph: A representation of the genome inferred from sequencing reads. Sequencing captures the genome as many fragmented pieces, instead of whole entire chromosomes at once (we eagerly await the day when this statement will be outdated!). The start of the assembly process pieces together these genome fragments to generate an assembly graph, which is a representation of the sequences and their overlaps. Visualizing assembly graphs can show where homozygous regions branch off into alternate paths on different haplotypes.

Unitig: Usually the smallest unit of an assembly graph, consistent with all the available sequencing data. A unitig is often constructed from an unambiguous path in the assembly graph where all the vertices have exactly one incoming and one outgoing edge, except the first vertex can have any number of incoming edges, while the last vertex can have any number of outgoing edges (Rahman and Medvedev 2022). In other words, the internal vertices in the unitig path can only be walked one way, so unitigs represent a path of confident sequence. In the assembly graph, unitig nodes can then have overlap edges with other unitigs.

Contig: A contiguous (i.e., gapless) sequence in an assembly, usually inferred algorithmically from the unitig graph.

False duplications: Assembly errors that result in one region of the genome being represented twice in the same assembly as two separate regions. Not to be confused with optical or technical duplicates from PCR from short-read sequencing. False duplications can further be classified as either haplotypic duplications or overlaps.

Haplotypic duplication can happen when a region that is heterozygous in the individual has the two haplotypes showing enough divergence that the assembler fails to interpret them as homologous. For example, say an individual is heterozygous in the region Chr1[1:100] and has Haplotype A from their mother and Haplotype B from their father; a false duplication can arise when the haplotypes are not recognized as being from the same region, and the assembler ends up placing both haplotypes in the same assembly, resulting in Chr1[1:100] being represented twice in one assembly. Ideally, a properly phased assembly would have Haplotype A in one assembly, e.g., the primary, while Haplotype B is in the alternate.

False duplications via overlaps result from unresolved overlaps in the assembly graph. In this case, two different contigs end up representing the same sequence, effectively duplicating it. Overlaps happen when a branching point in an assembly graph is resolved such that the contig before the vertex and a contig after the vertex share the same overlapping sequence.

Types of false duplication.Open image in new tab

Figure 1: Schematic of types of false duplication. Image adapted from Rhie et al. 2020.

Purging: Purging aims to remove false duplications, collapsed repeats, and very low support/coverage regions from an assembly. When performed on a primary assembly, the haplotigs are retained and typically placed in the alternate assembly.

Scaffold: A scaffold refers to one or more contigs separated by gap (unknown) sequence. Contigs are usually generated with the aid of additional information, such as Bionano optical maps, linked reads, Hi-C chromatin information, etc. The regions between contigs are usually of unknown sequence, thus they are represented by sequences of N’s. Gaps length in the sequence can be sized or arbitrary, depending on the technology used for scaffolding (e.g., optical maps can introduce sized gaps).

For more about the specific scaffolding technologies used in the VGP pipeline (currently Bionano optical maps and Hi-C chromatin conformation data), please refer to those specific sections within this tutorial.

HiFi reads: PacBio HiFi reads are the focus of this tutorial. First described in 2019, they have revolutionized genome assembly by combining long (about 10-20 kbp) read lengths with high accuracy (>Q20) typically associated with short-read sequencing (Wenger et al. 2019). These higher read lengths enable HiFi reads to traverse some repeat regions that are problematic to assemble with short reads.

Ultra-long reads: Ultra-long reads are typically defined as reads of over 100 kbp, and are usually generated using Oxford Nanopore Technology. Read quality is often lower than HiFi or Illumina (i.e., have a higher error rate), but they are often significantly longer than any other current sequencing technology, and can help assembly algorithms walk complex repeat regions in the assembly graphs.

Manual curation: This term refers to manually evaluating and manipulating an assembly based on the raw supporting evidence (e.g., using Hi-C contact map information). The user takes into account the original sequencing data to resolve potential misassemblies and missed joins.

Misassembly: Misassemblies are a type of assembly error that usually refers to any structural error in the genome reconstruction, .e.g., sequences that are not adjacent in the genome being placed next to each other in the sequence. Misassemblies can be potentially identified and remedied by manual curation.

Missed join: A missed join happens when two sequences are adjacent to each other in the genome but are not represented contiguously in the final sequence. Missed joins can be identified and remedied in manual curation with Hi-C data.

Telomere-to-telomere assembly: Often abbreviated as “T2T”, this term refers to an assembly where each chromosome is completely gapless from telomere to telomere. The term usually refers to the recently completed CHM13 human genome (Nurk et al. 2022), though there is an increasing number of efforts to generate T2T genomes for other species.

VGP assembly pipeline overview

The VGP assembly pipeline has a modular organization, consisting of ten workflows (Fig. 1). It can used with the following types of input data:

Input data Assembly quality Analysis trajectory
(Fig. 2)
HiFi The minimum requirement A
HiFi + HiC Better haplotype resolution (less haplotype switches) B
HiFi + BioNano Better contiguity C
HiFi + Hi-C + BioNano Even better contiguity D
HiFi + parental data Properly phased E
HiFi + parental data + Hi-C Better haplotype resolution F
HiFi + parental + BioNano Properly phased with improved contiguity G
HiFi + parental data + Hi-C + BioNano Properly phased with even more improved contiguity H

In this table, HiFi and Hi-C refer to HiFi and Hi-C data derived from the individual whose genome is being assembled. This tutorial assumes you are assembling the genome of one individual; there are special considerations necessary for pooled data that are not covered in this tutorial. HiFi and Hi-C are derived from the individual whose genome is being assembled. (Note: you can use Hi-C data from another individual of the same species to scaffold, but you cannot use that data to phase the contigs in hifiasm.) Parental data is high-coverage Illumina data derived from the parents of the individual being assembled, and is the key component of trio-based genome assembly. Each combination of input datasets is demonstrated in Fig. 2 by an analysis trajectory: a combination of workflows designed for generating the best assembly given a particular combination of inputs. These trajectories are listed in the table above and shown in the figure below.

The nine workflows of Galaxy assembly pipeline. The nine workflows of Galaxy assembly pipeline.
Open image in new tab

Figure 2: Eight analysis trajectories are possible depending on the combination of input data. A decision on whether or not to invoke Workflow 6 is based on the analysis of QC output of workflows 3, 4, or 5. Thicker lines connecting Workflows 7, 8, and 9 represent the fact that these workflows are invoked separately for each phased assembly (once for maternal and once for paternal).


The first stage of the pipeline is the generation of k-mer profiles of the raw reads to estimate genome size, heterozygosity, repetitiveness, and error rate. This is useful for getting an idea of the genome that lies within your reads, and is also useful for necessary for parameterizing downstream workflows. The generation of k-mer counts can be done from HiFi data only (Workflow 1), or include data from parental reads for trio-based phasing (Workflow 2), if one wants to generate k-mer spectra for the individual’s parents, as well. The second stage is contig assembly. In addition to using only HiFi reads (Workflow 3), the contig building (contiging) step can leverage Hi-C (Workflow 4) or parental read data (Workflow 5) to produce fully-phased haplotypes (hap1/hap2 or parental/maternal assigned haplotypes), using hifiasm. The contiging workflows also produce a number of critical quality control (QC) metrics such as k-mer multiplicity profiles. Inspection of these profiles provides information to decide whether the third stage—purging of false duplication—is required. Purging (Workflow 6), using purge_dups identifies and resolves haplotype-specific assembly segments incorrectly labeled as primary contigs, as well as heterozygous contig overlaps. This increases contiguity and the quality of the final assembly. The purging stage is generally unnecessary for trio data, as haplotype resolution is attained using set operations done on parental k-mers. The fourth stage, scaffolding, produces chromosome-level scaffolds using information provided by Bionano (Workflow 7, with Bionano Solve (optional)) and Hi-C (Workflow 8, with YaHS algorithms). A final stage of decontamination (Workflow 9) removes non-target sequences (e.g., contamination as well as mitochondrial sequences) from the scaffolded assembly. A separate workflow (WF0) is used for mitochondrial assembly.

Comment: A note on data quality

For diploids, we suggest at least 30✕ PacBio HiFi coverage & around 60✕ Hi-C coverage, and up to 60✕ HiFi coverage to accurately assemble highly repetitive regions.

This training has been organized into four main sections: genome profile analysis, assembly of HiFi reads with hifiasm, scaffolding with Bionano optical maps, and scaffolding with Hi-C data. Additionally, the assembly with hifiasm section has two possible paths in this tutorial: solo contiging or solo w/HiC contiging.

Throughout this tutorial, there will be detail boxes with additional background information on the science behind the sequencing technologies and software we use in the pipeline. These boxes are minimized by default, but please expand them to learn more about the data we utilize in this pipeline.

Get data

In order to reduce computation time, we will assemble samples from the yeast Saccharomyces cerevisiae S288C, a widely used laboratory strain isolated in the 1950s by Robert Mortimer. Using S. cerevisae, one of the most intensively studied eukaryotic model organisms, has the additional advantage of allowing us to evaluate the final result of our assembly with great precision. For this tutorial, we generated a set of synthetic HiFi reads corresponding to a theoretical diploid genome.

The synthetic HiFi reads were generated by using the S288C assembly as reference genome. With this objective we used HIsim, a HiFi shotgun simulator. The commands used are detailed below:

./FastK -T8 -Nsynthetic -k40 -t4 ./hifi_reads.fastq.gz
./FastK -T8 -Nsynthetic -k40 -p:synthetic.ktab .hifi_reads.fastq.gz
./Histex -G synthetic > synthetic.histogram
./GeneScopeFK.R -i synthetic.histogram -o ./synthetic -k 40 -p 1
./HImodel -v -osynthetic -g10:55 -e5 -T8 ./synthetic
./HIsim ./GCF_000146045.2_R64_genomic.fasta ./synthetic.model -p.2,.2 -c30 -r1 -U -f -h -e

The selected mutation rate was 2%. Note that HIsim generates the synthetic reads in FASTA format. This is perfectly fine for illustrating the workflow, but you should be aware that usually you will work with HiFi reads in FASTQ format.

The first step is to get the datasets from Zenodo. Specifically, we will be uploading two datasets:

  1. A set of PacBio HiFi reads in fasta format
  2. A set of Illumina Hi-C reads in fastqsanger.gz format

Uploading fasta datasets from Zenodo

The following two steps demonstrate how to upload three PacBio HiFi datasets into your Galaxy history.

Hands-on: Uploading FASTA datasets from Zenodo

Step 1: Create a new history for this tutorial

Click the new-history icon at the top of the history panel.

If the new-history is missing:

  1. Click on the galaxy-gear icon (History options) on the top of the history panel
  2. Select the option Create New from the menu

Step 2: Copy the following URLs into the clipboard.

  • you can do this by clicking on copy button in the right upper corner of the box below. It will appear if you mouse over the box.
   https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_01.fasta
   https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_02.fasta
   https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_03.fasta

Step 3: Upload datasets into Galaxy.

  • set the datatype to fasta
  • Copy the link location
  • Click galaxy-upload Upload Data at the top of the tool panel

  • Select galaxy-wf-edit Paste/Fetch Data
  • Paste the link(s) into the text field

  • Change Type (set all): from “Auto-detect” to fasta

  • Press Start

  • Close the window

Uploading fasta or fasta.gz datasets via URL.

UploadAnimatedPng

Uploading fastqsanger.gz datasets from Zenodo

Illumina Hi-C data is uploaded in essentially the same way as shown in the following two steps.

Warning: DANGER: Make sure you choose the correct format!

When selecting datatype in “Type (set all)” drop-down, make sure you select fastaqsanger or fastqsanger.gz BUT NOT fastqcssanger or anything else!

Hands-on: Uploading fastqsanger.gz datasets from Zenodo

Step 1: Copy the following URLs into the clipboard. You can do this by clicking on copy button in the right upper corner of the box below. It will appear if you mouse over the box.

   https://zenodo.org/record/5550653/files/SRR7126301_1.fastq.gz
   https://zenodo.org/record/5550653/files/SRR7126301_2.fastq.gz

Step 2: Upload datasets into Galaxy and set the datatype to fastqsanger.gz

  • Copy the link location
  • Click galaxy-upload Upload Data at the top of the tool panel

  • Select galaxy-wf-edit Paste/Fetch Data
  • Paste the link(s) into the text field

  • Change Type (set all): from “Auto-detect” to fastqsanger.gz

  • Press Start

  • Close the window

Uploading fastqsanger or fastqsanger.gz datasets via URL.

  1. Click on Upload Data on the top of the left panel:

    UploadDataButton

  2. Click on Paste/Fetch:

    PasteFetchButton

  3. Paste URL into text box that would appear:

    PasteFetchModal

  4. Set Type (set all) to fastqsanger or, if your data is compressed as in URLs above (they have .gz extensions), to fastqsanger.gz

    ChangeTypeDropDown:

Warning: Danger: Make sure you choose corect format!

When selecting datatype in “Type (set all)” dropdown, make sure you select fastaqsanger or fastqsanger.gz BUT NOT fastqcssanger or anything else!

UploadAnimatedPng

Warning: These datasets are large!

Hi-C datasets are large. It will take some time (~15 min) for them to be fully uploaded. Please, be patient.

Organizing the data

If everything goes smoothly your history will look like shown in the figure below. The three HiFi fasta files are better represented as a collection: {collection}. Also, importantly, the workflow we will be using for the analysis of our data takes collection as input (it does not access individual datasets). So let’s create a collection using steps outlined in the Tip tip “Creating a dataset collection”:

  • Click on galaxy-selector Select Items at the top of the history panel Select Items button
  • Check all the datasets in your history you would like to include
  • Click n of N selected and choose Build Dataset List

    build list collection menu item

  • Enter a name for your collection
  • Click Create List to build your collection
  • Click on the checkmark icon at the top of your history again

Creating a simple collection

The view of your history should transition from what is shown in the left pane below to what looks like the right pane:

AfterUpload.
Open image in new tab

Figure 3: History after uploading HiFi and HiC data (left). Creation of a list (collection) combines all HiFi datasets into a single history item called 'HiFi data' (right). See below for instructions on how to make this collection.

You can obviously upload your own datasets via URLs as illustrated above or from your own computer. In addition, you can upload data from a major repository called GenomeArk. GenomeArk is integrated directly into Galaxy Upload. To use GenomeArk follow the steps in the Tip tip below:

  1. Open the file galaxy-upload upload menu
  2. Click on Choose remote files tab
  3. Click on the Genome Ark button and then click on species

You can find the data by following this path: /species/${Genus}_${species}/${specimen_code}/genomic_data. Inside a given datatype directory (e.g. pacbio), select all the relevant files individually until all the desired files are highlighted and click the Ok button. Note that there may be multiple pages of files listed. Also note that you may not want every file listed.

HiFi reads preprocessing with cutadapt

Adapter trimming usually means trimming the adapter sequence off the ends of reads, which is where the adapter sequence is usually located in NGS reads. However, due to the nature of SMRT sequencing technology, adapters do not have a specific, predictable location in HiFi reads. Additionally, the reads containing adapter sequences could be of generally lower quality compared to the rest of the reads. Thus, we will use cutadapt not to trim, but to remove the entire read if a read is found to have an adapter inside of it.

PacBio HiFi reads rely on SMRT sequencing technology. SMRT is based on real-time imaging of fluorescently tagged nucleotides as they are added to a newly synthesized DNA strand. HiFi further uses multiple subreads from the same circular template to produce one highly accurate consensus sequence (fig. 2).

PacBio HiFi technology. Open image in new tab

Figure 4: HiFi reads are produced by calling consensus from subreads generated by multiple passes of the enzyme around a circularized template. This results in a HiFi read that is both long and accurate.

This technology allows to generate long-read sequencing data with read lengths in the range of 10-25 kb and minimum read consensus accuracy greater than 99% (Q20).

Hands-on: Primer removal with Cutadapt

Step 1: Run Cutadapt ( Galaxy version 4.4+galaxy0) with the following parameters:

  1. “Single-end or Paired-end reads?”: Single-end
  2. param-collection “FASTQ/A file”: HiFi_collection
  3. In “Read 1 Options”:
    • In “5’ or 3’ (Anywhere) Adapters”:
      • param-repeat “Insert 5’ or 3’ (Anywhere) Adapters”
        • “Source”: Enter custom sequence
          • “Enter custom 5’ or 3’ adapter name”: First adapter
          • “Enter custom 5’ or 3’ adapter sequence”: ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT
      • param-repeat “Insert 5’ or 3’ (Anywhere) Adapters”
        • “Source”: Enter custom sequence
          • “Enter custom 5’ or 3’ adapter name”: Second adapter
          • “Enter custom 5’ or 3’ adapter sequence”: ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT
  4. In “Adapter Options”:
    • “Maximum error rate”: 0.1
    • “Minimum overlap length”: 35
    • “Look for adapters in the reverse complement”: Yes
  5. In “Filter Options”:
    • “Discard Trimmed Reads”: Yes
  1. Click on param-collection Dataset collection in front of the input parameter you want to supply the collection to.
  2. Select the collection you want to use from the list


Step 2: Rename the output file as HiFi_collection (trimmed).

  • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
  • In the central panel, change the Name field
  • Click the Save button

Genome profile analysis

Before starting a de novo genome assembly project, it is useful to collect metrics on the properties of the genome under consideration, such as the expected genome size, so that you know what to expect from your assembly. Traditionally, DNA flow cytometry was considered the golden standard for estimating the genome size. Nowadays, experimental methods have been replaced by computational approaches (Wang et al. 2020). One of the widely used genome profiling methods is based on the analysis of k-mer frequencies. It allows one to provide information not only about the genomic complexity, such as the genome size and levels of heterozygosity and repeat content, but also about the data quality.

K-mers are unique substrings of length k contained within a DNA sequence. For example, the DNA sequence TCGATCACA can be decomposed into five unique k-mers that have five bases long: TCGAT, CGATC, GATCA, ATCAC and TCACA. A sequence of length L will have L-k+1 k-mers. On the other hand, the number of possible k-mers can be calculated as nk, where n is the number of possible monomers and k is the k-mer size.

Bases k-mer size Total possible k-mers
4 1 4
4 2 16
4 3 64
4
4 10 1.048.576

Thus, the k-mer size is a key parameter, which must be large enough to map uniquely to the genome, but not too large, since it can lead to wasting computational resources. In the case of the human genome, k-mers of 31 bases in length lead to 96.96% of unique k-mers.

Each unique k-mer can be assigned a value for coverage based on the number of times it occurs in a sequence, whose distribution will approximate a Poisson distribution, with the peak corresponding to the average genome sequencing depth. From the genome coverage, the genome size can be easily computed.

Generation of k-mer spectra with Meryl

Meryl will allow us to generate the k-mer profile by decomposing the sequencing data into k-length substrings, counting the occurrence of each k-mer and determining its frequency. The original version of Meryl was developed for the Celera Assembler. The current Meryl version comprises three main modules: one for generating k-mer databases, one for filtering and combining databases, and one for searching databases. K-mers are stored in lexicographical order in the database, similar to words in a dictionary (Rhie et al. 2020).

Comment: k-mer size estimation

Given an estimated genome size (G) and a tolerable collision rate (p), an appropriate k can be computed as \(k = \log_4\left(\frac{G(1-p)}{p}\right)\) .

Hands-on: Generate k-mers count distribution

Step 1: Run Meryl ( Galaxy version 1.3+galaxy6) with the following parameters:

  1. “Operation type selector”: Count operations
  2. “Count operations”: Count: count the occurrences of canonical k-mers
  3. param-collection “Input sequences”: HiFi_collection (trim)
  4. “k-mer size selector”: Set a k-mer size
  5. k-mer size”: 31
Comment: Selection of k-mer size

We used 31 as k-mer size, as this length has demonstrated to be sufficiently long that most k-mers are not repetitive and is short enough to be more robust to sequencing errors. For very large (haploid size > 10 Gb) and/or very repetitive genomes, larger k-mer length is recommended to increase the number of unique k-mers.


Step 2: Rename output as meryldb

Step 3: Run Meryl ( Galaxy version 1.3+galaxy6) again with the following parameters:

  1. “Operation type selector”: Operations on sets of *k*-mers
  2. “Operations on sets of k-mers”: Union-sum: return k-mers that occur in any input, set the count to the sum of the counts
  3. param-file “Input meryldb”: Collection meryldb

Step 4: Rename it as Merged meryldb

Step 5: Run Meryl ( Galaxy version 1.3+galaxy6) for the third time with the following parameters:

  • “Operation type selector”: Generate histogram dataset
    • param-file “Input meryldb”: Merged meryldb

Step 6: Finally, rename it as meryldb histogram.

Genome profiling with GenomeScope2

The next step is to infer the genome properties from the k-mer histogram generated by Meryl, for which we will use GenomeScope2. GenomeScope2 relies on a nonlinear least-squares optimization to fit a mixture of negative binomial distributions, generating estimated values for genome size, repetitiveness, and heterozygosity rates (Ranallo-Benavidez et al. 2020).

Hands-on: Estimate genome properties

Run GenomeScope ( Galaxy version 2.0+galaxy2) with the following parameters:

  1. param-file “Input histogram file”: meryldb histogram
  2. Ploidy for model to use: 2
  3. “k-mer length used to calculate k-mer spectra”: 31
  4. In “Output options”: mark Summary of the analysis
  5. In “Advanced options”:
  6. “Create testing.tsv file with model parameters”: Yes

Genomescope will generate six outputs:

  • Plots:
    • Linear plot: k-mer spectra and fitted models: frequency (y-axis) versus coverage.
    • Log plot: logarithmic transformation of the previous plot.
    • Transformed linear plot: k-mer spectra and fitted models: frequency times coverage (y-axis) versus coverage (x-axis). This transformation increases the heights of higher-order peaks, overcoming the effect of high heterozygosity.
    • Transformed log plot: logarithmic transformation of the previous plot.
  • Model: this file includes a detailed report about the model fitting.
  • Summary: it includes the properties inferred from the model, such as genome haploid length and the percentage of heterozygosity.

Now, let’s analyze the k-mer profiles, fitted models and estimated parameters shown below:

Genomescope plot. Open image in new tab

Figure 5: GenomeScope2 31-mer profile. The first peak located at coverage 25✕ corresponds to the heterozygous peak. The second peak at coverage 50✕, corresponds to the homozygous peak. Estimate of the heterozygous portion is 0.576%. The plot also includes information about the inferred total genome length (len), genome unique length percent ('uniq'), overall heterozygosity rate ('ab'), mean *k*-mer coverage for heterozygous bases ('kcov'), read error rate ('err'), and average rate of read duplications ('dup'). It also reports the user-given parameters of *k*-mer size ('k') and ploidy ('p').

This distribution is the result of the Poisson process underlying the generation of sequencing reads. As we can see, the k-mer profile follows a bimodal distribution, indicative of a diploid genome. The distribution is consistent with the theoretical diploid model (model fit > 93%). Low frequency k-mers are the result of sequencing errors. GenomeScope2 estimated a haploid genome size is around 11.7 Mb, a value reasonably close to Saccharomyces genome size. Additionally, it revealed that the variation across the genomic sequences is 0.576%.

Comment: Are you expecting to purge your assembly?

This tutorial covers purging using the program purge_dups. purge_dups has some default options and can try to detect coverage-based cutoffs automatically, but the VGP pipeline prefers to define these cutoffs using parameters derived from the GenomeScope2 output.

If you expect you need to purge your genome, please see the solo contiging section of the tutorial for details on parsing the GenomeScope2 output for purging cutoffs.

Assembly with hifiasm

Once we have finished the genome profiling stage, we can start the genome assembly with hifiasm, a fast open-source de novo assembler specifically developed for PacBio HiFi reads. One of the key advantages of hifiasm is that it allows us to resolve near-identical, but not exactly identical, sequences, such as repeats and segmental duplications (Cheng et al. 2021).

By default hifiasm performs three rounds of haplotype-aware error correction to correct sequence errors but keeping heterozygous alleles. A position on the target read to be corrected is considered informative if there are two different nucleotides at that position in the alignment, and each allele is supported by at least three reads.

Hifiasm algorithm overview. Open image in new tab

Figure 6: Hifiasm algorithm overview. Orange and blue bars represent the reads with heterozygous alleles carrying local phasing information, while green bars come from the homozygous regions without any heterozygous alleles.

Then, hifiasm builds a phased assembly string graph with local phasing information from the corrected reads. Only the reads coming from the same haplotype are connected in the phased assembly graph. After transitive reduction, a pair of heterozygous alleles is represented by a bubble in the string graph. If there is no additional data, hifiasm arbitrarily selects one side of each bubble and outputs a primary assembly. In the case of a heterozygous genome, the primary assembly generated at this step may still retain haplotigs from the alternate allele.

The output of hifiasm will be GFA files. These differ from FASTA files in that they are a representation of the assembly graph instead of just linear sequences, so the GFA contains information about sequences, nodes, and edges (i.e., overlaps). This output preserves the most information about how the reads assemble in graph space, and is useful to visualize in tools such as Bandage; however, our QV tools will expect FASTA files, so we will cover the GFA to FASTA conversion step later.

hifiasm assembly modes

Hifiasm can be run in multiple modes depending on data availability

Solo mode

Solo: generates a pseudohaplotype assembly, resulting in a primary & an alternate assembly.

  • Input: PacBio HiFi reads
  • Output: scaffolded primary assembly, and alternate contigs
Diagram for hifiasm solo mode.Open image in new tab

Figure 7: The solo pipeline creates primary and alternate contigs, which then typically undergo purging with purge_dups to reconcile the haplotypes. During the purging process, haplotigs are removed from the primary assembly and added to the alternate assembly, which is then purged to generate the final alternate set of contigs. The purged primary contigs are then carried through scaffolding with Bionano and/or Hi-C data, resulting in one final draft primary assembly to be sent to manual curation.

Hi-C phased mode

Hi-C-phased: generates a hap1 assembly and a hap2 assembly, which are phased using the Hi-C reads from the same individual.

  • Input: PacBio HiFi & Illumina HiC reads
  • Output: scaffolded hap1 assembly, and scaffolded hap2 assembly (assuming you run the scaffolding on both haplotypes)
Diagram for hifiasm hic mode.Open image in new tab

Figure 8: The Hi-C-phased mode produces hap1 and hap2 contigs, which have been phased using the HiC information as described in Cheng et al. 2021. Typically, these assemblies do not need to undergo purging, but you should always look at your assemblies' QC to make sure. These contigs are then scaffolded separately using Bionano and/or Hi-C workflows, resulting in two scaffolded assemblies.

Trio mode

Trio: generates a maternal assembly and a paternal assembly, which are phased using reads from the parents.

  • Input: PacBio HiFi reads from child, Illumina reads from both parents.
  • Output: scaffolded maternal assembly, and scaffolded paternal assembly (assuming you run the scaffolding on both haplotypes)
Diagram for hifiasm trio mode.Open image in new tab

Figure 9: The trio mode produces maternal and paternal contigs, which have been phased using paternal short read data. Typically, these assemblies do not need to undergo purging, but you should always look at your assemblies' QC to make sure. These contigs are then scaffolded separately using Bionano and/or Hi-C workflows, resulting in two scaffolded assemblies.

No matter which way you run hifiasm, you will have to evaluate the assemblies’ QC to ensure your genome is in good shape. The VGP pipeline features several reference-free ways of evaluating assembly quality, all of which are automatically generated with our workflows; however, we will run them manually in this tutorial so we can familiarize ourselves with how each QC metric captures a different aspect of assembly quality.

Assembly evaluation

There are several tools for assessing various aspects of assembly quality:

  • gfastats: manipulation & evaluation of assembly graphs and FASTA files, particularly used for summary statistics (e.g., contig count, N50, NG50, etc.) (Formenti et al. 2022).
Schematic of N50 calculation.Open image in new tab

Figure 10: N50 is a commonly reported statistic used to represent genome contiguity. N50 is calculated by sorting contigs according to their lengths, and then taking the halfway point of the total genome length. The size of the contig at that halfway point is the N50 value. In the pictured example, the total genome length is 400 bp, so the N50 value is 60 because the contig at the halfway point is 60 bp long. N50 can be interpreted as the value where >50% of an assembly's contigs are at that value or higher. Image adapted from Elin Videvall at The Molecular Ecologist.
  • Benchmarking Universal Single-Copy Orthologs (BUSCO): assesses completeness of a genome from an evolutionarily informed functional point of view. BUSCO genes are genes that are expected to be present at single-copy in one haplotype for a certain clade, so their presence, absence, or duplication can inform scientists about if an assembly is likely missing important regions, or if it has multiple copies of them, which can indicate a need for purging (Simão et al. 2015).
  • Merqury: reference-free assessment of assembly completeness and phasing based on k-mers. Merqury compares k-mers in the reads to the k-mers found in the assemblies, as well as the copy number (CN) of each k-mer in the assemblies (Rhie et al. 2020).
Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

Use the following buttons to switch between contigging approaches. If you are assembling with only HiFi reads for an individual, then click solo. If you have HiC reads for the same indiviudal, then click hic. NOTE: If you want to learn more about purging, then please check out the solo tutorial for details on purging false duplications.

HiC-phased assembly with hifiasm

If you have the Hi-C data for the individual you are assembling with HiFi reads, then you can use that information to phase the {contigs}.

Hands-on: Hi-C-phased assembly with hifiasm

Step 1: Run Hifiasm ( Galaxy version 0.19.8+galaxy0) with the following parameters:

  1. “Assembly mode”: Standard
  2. param-file “Input reads”: HiFi_collection (trim) (output of Cutadapt tool)
  3. In “Options for Hi-C-partition” select Specify
    • “Hi-C R1 reads”: Hi-C_dataset_F
    • “Hi-C R2 reads”: Hi-C_dataset_R

Step 2:. After the tool has finished running, rename its outputs as follows:

  1. Rename the Hi-C hap1 balanced contig graph as Hap1 contigs graph and add a #hap1 tag
  2. Rename the Hi-C hap2 balanced contig graph as Hap2 contigs graph and add a #hap2 tag

We have obtained the fully phased contig graphs (as Graphical Fragment Assembly (GFA) files) of hap1 and hap2, but these must be converted to FASTA format for subsequent steps. We will use a tool developed from the VGP: gfastats. gfastats is a tool suite that allows for manipulation and evaluation of FASTA and GFA files, but in this instance we will use it to convert our GFAs to FASTA files. Later on, we will use it to generate standard summary statistics for our assemblies.

Hands-on: GFA to FASTA conversion for hifiasm Hi-C assembly

Step 1: Run gfastats ( Galaxy version 1.3.6+galaxy0) with the following parameters:

  1. param-files “Input GFA file”: select Hap1 contigs graph and the Hap2 contigs graph datasets
  1. Click on param-files Multiple datasets
  2. Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest
  1. “Tool mode”: Genome assembly manipulation
  2. “Output format”: FASTA
  3. “Generates the initial set of paths”: toggle to yes

Step 2: Rename the outputs as Hap1 contigs FASTA and Hap2 contigs FASTA

Comment: Summary statistics

gfastats will provide us with the following statistics:

  • No. of contigs: The total number of contigs in the assembly.
  • Largest contig: The length of the largest contig in the assembly.
  • Total length: The total number of bases in the assembly.
  • Nx: The largest contig length, L, such that contigs of length > L account for at least x% of the bases of the assembly.
  • NGx: The contig length such that using equal or longer length contigs produces x% of the length of the reference genome, rather than x% of the assembly length.
  • GC content: the percentage of nitrogenous bases which are either guanine or cytosine.

Let’s use gfastats to get a basic idea of what our assembly looks like. We’ll run gfastats on the GFA files because gfastats can report graph-specific statistics as well. After generating the stats, we’ll be doing some text manipulations to get the stats side-by-side in a pretty fashion.

Hands-on: Assembly evaluation with gfastats

Step 1: Run assembly statistics generation with gfastats ( Galaxy version 1.3.6+galaxy0) using the following parameters:

  1. param-files “Input file”: select Hap1 contigs graph and the Hap2 contigs graph datasets
  2. *“Tool mode”: Summary statistics generation
  3. “Expected genome size”: 11747160 (remember we calculated this value earlier using GenomeScope2. It is contained within GenomeScope2 Summary output that should be in your history!)
  4. “Thousands separator in output”: Set to “No”


Step 2: Rename outputs of gfastats step to as Hap1 stats and Hap2 stats

This would generate summary files that look like this (only the first six rows are shown):

Expected genome size    11747160
# scaffolds                    0
Total scaffold length          0
Average scaffold length      nan
Scaffold N50                   0
Scaffold auN                0.00

Because we ran gfastats on hap1 and hap2 outputs of hifiasm we need to join the two outputs together for easier interpretation:


Step 3: Run Column join ( Galaxy version 0.0.3) with the following parameters:

param-files “Input file”: select Hap1 stats and the Hap2 stats datasets. Keep all other settings as they are.


Step 4: Rename the output as gfastats on hap1 and hap2 (full)

This would generate a joined summary file that looks like this (only the first five rows are shown):

# gaps               0  0
# gaps in scaffolds  0  0
# paths              0  0
# segments          17 16

Now let’s extract only relevant information by excluding all lines containing the word scaffold since there are no scaffolds at this stage of the assembly process (only contigs):


Step 5: Run Search in textfiles ( Galaxy version 1.1.1) with the following parameters:

  1. param-files “Input file”: select gfastats on hap1 and hap2 (full)
  2. “that”: Don't Match
  3. “Type of regex”: Basic
  4. “Regular Expression”: enter the word scaffold
  5. “Match type”: leave as case insensitive


Step 6: Rename the output as gfastats on hap1 and hap2 contigs

Take a look at the gfastats on hap1 and hap2 contigs output — it has three columns:

  1. Name of statistic
  2. Value for haplotype 1 (hap1)
  3. Value for haplotype 2 (hap2)

According to the report, both assemblies are quite similar; the hap1 assembly includes 16 contigs, totalling ~11.3Mbp of sequence (the Total contig length statistic), while the hap2 assembly includes 17 contigs, whose total length is ~12.2Mbp. (NB: Your values may differ slightly, or be reversed between the two haplotypes!)

Question
  1. What is the length of the longest contigs in the assemblies?
  2. What are the N50 values of the two assemblies? Are they very different from each other?
  1. One assembly’s longest contig is 1,532,843 bp, and the other one’s is 1,531,728 bp.
  2. One assembly has a N50 of 922,430 and the other’s is 923,452. These are pretty close to each other!

Next, we will use BUSCO, which will provide quantitative assessment of the completeness of a genome assembly in terms of expected gene content. It relies on the analysis of genes that should be present only once in a complete assembly or gene set, while allowing for rare gene duplications or losses (Simão et al. 2015).

Hands-on: Assessing assembly completeness with BUSCO

Step 1: Run Busco ( Galaxy version 5.5.0+galaxy0) with the following parameters:

  1. param-files “Sequences to analyze”: Hap1 contigs FASTA and Hap2 contigs FASTA
  2. “Lineage data source”: Use cached lineage data
  3. “Cached database with lineage”: Busco v5 Lineage Datasets
  4. “Mode”: Genome assemblies (DNA)
  5. “Use Augustus instead of Metaeuk”: Use Metaeuk
  6. “Auto-detect or select lineage?”: Select lineage
  7. “Lineage”: Saccharomycetes
  8. “Which outputs should be generated”: short summary text and summary image
Comment

Remember to modify the “Lineage” option if you are working with vertebrate genomes.


Step 2: Rename the outputs as BUSCO hap1 and BUSCO hap2.

We have asked BUSCO to generate two particular outputs: the short summary, and a summary image.

BUSCO for hap1 & hap2.BUSCO for hap1 & hap2.
Open image in new tab

Figure 11: BUSCO results for hap1 and hap2. Each haplotype is showing the summary image output as well as the short summary output. The summary image gives a good overall idea of the status of BUSCO genes within the assembly, while the short summary lists these as percentages as well. In our case, neither assembly seems to have duplicated BUSCO genes (there is a very low amount of dark blue in the summary images).
Question
  1. How many complete and single copy BUSCO genes have been identified in the hap1 assembly? What percentage of the total BUSCO gene set is that?
  2. How many BUSCO genes are absent in the hap1 assembly?
  1. Hap1 has 2,047 complete and single-copy BUSCO genes, which is 95.8% of the gene set.
  2. 29 BUSCO genes are missing.

Despite BUSCO being robust for species that have been widely studied, it can be inaccurate when the newly assembled genome belongs to a taxonomic group that is not well represented in OrthoDB. Merqury provides a complementary approach for assessing genome assembly quality metrics in a reference-free manner via k-mer copy number analysis. Let’s run Merqury evaluation as shown below.

Hands-on: k-mer based evaluation with Merqury

Run Merqury ( Galaxy version 1.3+galaxy3) with the following parameters:

  1. “Evaluation mode”: Default mode
  2. param-file “k-mer counts database”: Merged meryldb
  3. “Number of assemblies”: Two assemblies
  4. param-file “First genome assembly”: Hap1 contigs FASTA
  5. param-file “Second genome assembly”: Hap2 contigs FASTA

By default, Merqury generates three collections as output: stats, plots and consensus accuracy quality value (QV) stats. The “stats” collection contains the completeness statistics, while the “QV stats” collection contains the quality value statistics. Let’s have a look at the assembly CN spectrum plot, known as the spectra-cn.fl plot:

Merqury spectra-cn plot for the hap1/hap2 assemblies.Open image in new tab

Figure 12: Merqury CN plot. This plot tracks the multiplicity of each k-mer found in the HiFi read set and colors it by the number of times it is found in a given assembly. Merqury connects the midpoint of each histogram bin with a line, giving the illusion of a smooth curve.

The grey region in the left side corresponds to k-mers found only in the read set; it is usually indicative of sequencing error in the read set, although it can also be a result of missing sequences in the assembly. The red area represents one-copy k-mers in the genome, while the blue area represents two-copy k-mers originating from homozygous sequences or haplotype-specific duplications. From this figure, we can state that the diploid sequencing coverage is around 50✕, which we also know from the GenomeScope2 plot we looked at earlier.

To get an idea of how the k-mers have been distributed between our hap1 and hap2 assemblies, we should look at the spectra-asm.fl output of Merqury.

Merqury spectra-asm plot for the hap1/hap2 assemblies.Open image in new tab

Figure 13: Merqury ASM plot. This plot tracks the multiplicity of each k-mer found in the HiFi read set and colors it according to which assemblies contain those k-mers. This can tell you which k-mers are found in only one assembly or shared between them.

The large green peak is centered at 50✕ coverage (remember that’s our diploid coverage!), indicating that k-mers suggested by the reads to be from diploid regions are in fact shared between the two assemblies, as they should be if they are from homozygous regions. The haploid coverage k-mers (around ~25✕ coverage) are split between hap1 and hap2 assemblies, somewhat unevenly but still not as bad as it would be in an assembly without phasing data.

Pseudohaplotype assembly with hifiasm

When hifiasm is run without any additional phasing data, it will do its best to generate a pseudohaplotype primary/alternate set of assemblies. These assemblies will typically contain more contigs that switch between parental blocks. Because of this, the primary assembly generated with this method can have a higher \(N50\) value than an assembly generated with haplotype-phasing, but the contigs will contain more switch errors.

Hands-on: Pseudohaplotype assembly with hifiasm

Step 1: Run Hifiasm ( Galaxy version 0.19.8+galaxy0) with the following parameters:

  1. “Assembly mode”: Standard
  2. param-file “Input reads”: HiFi_collection (trim) (output of Cutadapt tool)
  3. “Options for purging duplicates”: Specify
  4. “Purge level”: Light (1)
Comment: A note on hifiasm purging level

hifiasm has an internal purging function, which we have set to Light here. The VGP pipeline currently disables the hifiasm internal purging, in favor of using the purge_dups suite after the fact in order to have more control over the parameters used for purging.


Step 2: After the tool has finished running, rename its outputs as follows:

  1. Rename the primary assembly contig graph for pseudohaplotype assembly as Primary contigs graph and add a #pri tag
  2. Rename the alternate assembly contig graph for pseudohaplotype assembly as Alternate contigs graph and add a #alt tag

We have obtained the primary and alternate contig graphs (as GFA files), but these must be converted to FASTA format for subsequent steps. We will use a tool developed from the VGP: gfastats. gfastats is a tool suite that allows for manipulation and evaluation of FASTA and GFA files, but in this instance we will use it to convert our GFAs to FASTA files. Later on we will use it to generate standard summary statistics for our assemblies.

Hands-on: convert GFA to FASTA

Below we start two gfastats jobs by selecting two input datasets. To do this:

  1. Click on param-files Multiple datasets
  2. Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest


Step 1: Run gfastats ( Galaxy version 1.3.6+galaxy0) with the following parameters:

  1. param-files “Input GFA file”: select Primary contigs graph and the Alternate contigs graph datasets
  2. “Tool mode”: Genome assembly manipulation
  3. “Output format”: FASTA
  4. “Generates the initial set of paths”: toggle to yes

Step 2: Rename the outputs as Primary contigs FASTA and Alternate contigs FASTA

Comment: Summary statistics

gfastats will provide us with the following statistics:

  • No. of contigs: The total number of contigs in the assembly.
  • Largest contig: The length of the largest contig in the assembly.
  • Total length: The total number of bases in the assembly.
  • Nx: The largest contig length, L, such that contigs of length >= L account for at least x% of the bases of the assembly.
  • NGx: The contig length such that using equal or longer length contigs produces x% of the length of the reference genome, rather than x% of the assembly length.
  • GC content: the percentage of nitrogenous bases which are either guanine or cytosine.

Let’s use gfastats to get a basic idea of what our assembly looks like. We’ll run gfastats on the GFA files because gfastats can report graph-specific statistics as well. After generating the stats, we’ll be doing some text manipulation to get the stats side-by-side in a pretty fashion.

Hands-on: Assembly evaluation with gfastats

Step 1: Run assembly statistics generation with gfastats ( Galaxy version 1.3.6+galaxy0) using the following parameters:

  1. param-files “Input file”: select Primary contigs graph and the Alternate contigs graph datasets
  2. *“Tool mode”: Summary statistics generation
  3. “Expected genome size”: 11747160 (remember we calculated this value earlier using GenomeScope2. It is contained within GenomeScope2 Summary output that should be in your history!)
  4. “Thousands separator in output”: Set to “No”
  5. “Generates the initial set of paths”: toggle to yes


Step 2: Rename outputs of gfastats step to as Primary stats and Alternate stats

This would generate summary files that look like this (only the first six rows are shown):

Expected genome size     11747160
# scaffolds                    25
Total scaffold length    18519764
Average scaffold length    740790.56
Scaffold N50               813311
Scaffold auN               913050.77

Because we ran gfastats on Primary and Alternate outputs of hifiasm we need to join the two outputs together for easier interpretation:


Step 3: Run Column join ( Galaxy version 0.0.3) with the following parameters:

param-files “Input file”: select Primary stats and the Alternate stats datasets (these are from Step 2 above). Keep all other setting as they are.


Step 4: Rename the output as gfastats on Pri and Alt (full)

This would generate a joined summary file that looks like this (only five rows are shown):

# contigs                 25  10
# dead ends                .  16
# disconnected components  .   7
# edges                    .   6
# gaps                     0   0

Now let’s extract only relevant information by excluding all lines containing the word scaffold since there are no scaffolds at this stage of the assembly process (only contigs):


Step 5: Run Search in textfiles ( Galaxy version 1.1.1) with the following parameters:

  1. param-files “Input file”: select gfastats on Pri and Alt (full)
  2. “that”: Don't Match
  3. “Type of regex”: Basic
  4. “Regular Expression”: enter the word scaffold
  5. “Match type”: leave as case insensitive


Step 6: Rename the output as gfastats on Pri and Alt contigs

Take a look at the gfastats on Pri and Alt contigs output — it has three columns:

  1. Name of statistic
  2. Value for haplotype 1 (Pri)
  3. Value for haplotype 2 (Alt)

The report makes it clear that the two assemblies are markedly uneven: the primary assembly has 25 contigs totalling ~18.5 Mbp, while the alternate assembly has 8 contigs totalling only about 4.95 Mbp. If you’ll remember that our estimated genome size is ~11.7 Mbp, then you’ll see that the primary assembly has almost 2/3 more sequence than expected for a haploid representation of the genome! This is because a lot of heterozygous regions have had both copies of those loci placed into the primary assembly, as a result of incomplete purging. The presence of false duplications can be confirmed by looking at BUSCO and Merqury results.

Question
  1. What is the length of the longest contigs in the assemblies?
  2. What are the N50 values of the two assemblies? Are they very different from each other? What might explain the difference between the two?
  1. The primary’s longest contig is 1,532,843 bp while the alternate’s longest contig is 1,090,521.
  2. The primary assembly has a N50 of 813,311, while the alternate’s is 1,077,964. These are not that far apart. The alternate might have a bigger N50 due to having fewer contigs.

Next, we will use BUSCO, which will provide quantitative assessment of the completeness of a genome assembly in terms of expected gene content. It relies on the analysis of genes that should be present only once in a complete assembly or gene set, while allowing for rare gene duplications or losses (Simão et al. 2015).

Hands-on: Assessing assembly completeness with BUSCO

Step 1: Run Busco ( Galaxy version 5.5.0+galaxy0) with the following parameters:

  1. param-files “Sequences to analyze”: Primary contigs FASTA and Alternate contigs FASTA
  2. “Lineage data source”: Use cached lineage data
  3. “Cached database with lineage”: Busco v5 Lineage Datasets
  4. “Mode”: Genome assemblies (DNA)
  5. “Use Augustus instead of Metaeuk”: Use Metaeuk
  6. “Auto-detect or select lineage?”: Select lineage
  7. “Lineage”: Saccharomycetes
  8. “Which outputs should be generated”: short summary text and summary image
Comment

Remember to modify the “Lineage” option if you are working with vertebrate genomes.


Step 2: Rename the outputs as BUSCO Pri and BUSCO Alt.

We have asked BUSCO to generate two particular outputs: the short summary, and a summary image.

BUSCO for primary contigs.
Open image in new tab

Figure 14: BUSCO results for primary and alternate contigs. The summary image (left) gives a good overall idea of the status of BUSCO genes within the assembly, while the short summary (right) lists these as percentages as well. In this case, this primary assembly seems to have a large amount of duplicated BUSCO genes, but is otherwise complete (i.e., not much missing content).

The BUSCO results support our hypothesis that the primary assembly is so much larger than expected due to improper purging, resulting in false duplications.

Question
  1. How many complete and single copy BUSCO genes have been identified in the pri assembly? What percentage of the total BUSCO gene set is that?
  2. How many complete and duplicated BUSCO genes have been identified in the pri assembly? What percentage of the total BUSCO gene set is that?
  3. What is the percentage of complete BUSCO genes for the primary assembly?
  1. The primary assembly has 1159 complete and single-copy BUSCO genes, which is 54.2% of the gene set.
  2. The primary assembly has 927 complete and duplicated BUSCO genes, which is 43.4% of the gene set.
  3. 97.6% of BUSCOs are complete in the assembly.

Despite BUSCO being robust for species that have been widely studied, it can be inaccurate when the newly assembled genome belongs to a taxonomic group that is not well represented in OrthoDB. Merqury provides a complementary approach for assessing genome assembly quality metrics in a reference-free manner via k-mer copy number analysis.

Hands-on: k-mer based evaluation with Merqury

Run Merqury ( Galaxy version 1.3+galaxy3) with the following parameters:

  1. “Evaluation mode”: Default mode
  2. param-file “k-mer counts database”: Merged meryldb
  3. “Number of assemblies”: Two assemblies
  4. param-file “First genome assembly”: Primary contigs FASTA
  5. param-file “Second genome assembly”: Alternate contigs FASTA
    (REMINDER: Primary contigs FASTA and Alternate contigs FASTA were generated earlier)

By default, Merqury generates three collections as output: stats, plots and QV stats. The “stats” collection contains the completeness statistics, while the “QV stats” collection contains the quality value statistics. Let’s have a look at the assembly CN spectrum plot, known as the spectra-cn plot:

Merqury spectra-cn plot for the pri/alt assemblies.Open image in new tab

Figure 15: Merqury CN plot. This plot tracks the multiplicity of each k-mer found in the Hi-Fi read set and colors it by the number of times it is found in a given assembly. Merqury connects the midpoint of each histogram bin with a line, giving the illusion of a smooth curve.

The black region in the left side corresponds to k-mers found only in the read set; it is usually indicative of sequencing error in the read set, although it can also be indicative of missing sequences in the assembly. The red area represents one-copy k-mers in the genome, while the blue area represents two-copy k-mers originating from homozygous sequences or haplotype-specific duplications. From this figure, we can state that the diploid sequencing coverage is around 50✕, which we also know from the GenomeScope2 plot we looked at earlier.

To get an idea of how the k-mers have been distributed between our Primary and Alternate assemblies, we should look at the spectra-asm output of Merqury.

Merqury spectra-asm plot for the hap1/hap2 assemblies.Open image in new tab

Figure 16: Merqury ASM plot. This plot tracks the multiplicity of each k-mer found in the HiFi read set and colors it according to which assemblies contain those k-mers. This can tell you which k-mers are found in only one assembly or shared between them.

For an idea of what a properly phased spectra-asm plot would look like, please go over to the Hi-C phasing version of this tutorial. A properly phased spectra-asm plot should have a large green peak centered around the point of diploid coverage (here ~50✕), and the two assembly-specific peaks should be centered around the point of haploid coverage (here ~25✕) and resembling each other in size.

The spectra-asm plot we have for our primary & alternate assemblies here does not resemble one that is properly phased. There is a peak of green (shared) k-mers around diploid coverage, indicating that some homozygous regions have been properly split between the primary and alternate assemblies; however, there is still a large red peak of primary-assembly-only k-mers at that coverage value, too, which means that some homozygous regions are being represented twice in the primary assembly, instead of once in the primary and once in the alternate. Additionally, for the haploid peaks, the primary-only peak (in red) is much larger than the alternate-only peak (in blue), indicating that a lot of heterozygous regions might have both their alternate alleles represented in the primary assembly, which is false duplication.

For further confirmation, we can also look at the individual, assembly-specific CN plots. In the Merqury outputs, the output_merqury.assembly_01.spectra-cn.fl is a CN spectra with k-mers colored according to their copy number in the primary assembly:

Merqury spectra-cn plot for the pri assembly only.Open image in new tab

Figure 17: Merqury CN plot for the primary assembly only. This plot colors k-mers according to their copy number in the primary assembly. k-mers that are present in the reads but not the primary assembly are labelled 'read-only'.

In the primary-only CN plot, we observe a large 2-copy (colored blue) peak at diploid coverage. Ideally, this would not be here, because these diploid regions would be 1-copy in both assemblies. Purging this assembly should reconcile this by removing one copy of false duplicates, making these 2-copy k-mers 1-copy. You might notice the ‘read-only’ peak at haploid coverage — this is actually expected, because ‘read-only’ here just means that the k-mer in question is not seen in this specific assembly while it was in the original readset. Often, these ‘read-only’ k-mers are actually present as alternate loci in the other assembly.

Now that we have looked at our primary assembly with multiple QC metrics, we know that it should undergo purging. The VGP pipeline uses purge_dups to remove false duplications from the primary assembly and put them in the alternate assembly to reconcile the haplotypes. Additionally, purge_dups can also find collapsed repeats and regions of suspiciously low coverage.

Purging the primary and alternate assemblies

Before proceeding to purging, we need to carry out some text manipulation operations on the output generated by GenomeScope2 to make it compatible with downstream tools. The goal is to extract some parameters which at a later stage will be used by purge_dups.

Parsing purge_dups cutoffs from GenomeScope2 output

The first relevant parameter is the estimated genome size.

Hands-on: Get estimated genome size

Step 1: Open Replace parts of text ( Galaxy version 1.1.4)
Step 2: Scroll down to find ”+ Insert Find and Replace” button and click it.
Step 3: Scroll down again to find ”+ Insert Find and Replace” button and click it again. After this you should have “Find and Replace” panel repeated three times: “1: Find and Replace”, “2: Find and Replace”, and “3: Find and Replace”.
Step 4: In param-file “File to process”: Select GenomeScope summary output (generated during k-mer profiling step). The input file should have content that looks like this (it may not be exactly like this):

GenomeScope version 2.0
input file = ....
output directory = .
p = 2
k = 31
TESTING set to TRUE

property                      min               max               
Homozygous (aa)               99.4165%          99.4241%          
Heterozygous (ab)             0.575891%         0.583546%         
Genome Haploid Length         11,739,321 bp     11,747,160 bp     
Genome Repeat Length          722,921 bp        723,404 bp        
Genome Unique Length          11,016,399 bp     11,023,755 bp     
Model Fit                     92.5159%          96.5191%          
Read Error Rate               0.000943206%      0.000943206%   

Step 5: In the first Find and Replace panel “1: Find and Replace” set the following parameters:

  1. “Find pattern”: ^(?!Genome Haploid Length).*\n
  2. “Find-Pattern is a regular expression”: Toggle to Yes


Step 6: In the second Find and Replace panel “2: Find and Replace” set the following parameters:

  1. “Find pattern”: Genome Haploid Length\s+(\d{1,3}(?:,\d{3})*\s+bp)\s+(\d{1,3}(?:,\d{3})*)\s+bp
  2. “Replace with”: $2
  3. “Find-Pattern is a regular expression”: Toggle to Yes


Step 7: In the third Find and Replace panel “3: Find and Replace” set the following parameters:
“Find pattern”: , (Yes, just a comma)


Step 8: Rename the output as Estimated genome size.

Question

What is the estimated genome size?

The estimated genome size is 11,747,160 bp.

Now let’s parse the transition between haploid & diploid and upper bound for the read depth estimation parameters. The transition between haploid & diploid represents the coverage value halfway between haploid and diploid coverage, and helps purge_dups identify haplotigs. The upper bound parameter will be used by purge_dups as high read depth cutoff to identify collapsed repeats. When repeats are collapsed in an assembly, they are not as long as they actually are in the genome. This results in a pileup of reads at the collapsed region when mapping the reads back to the assembly.

Hands-on: Get maximum read depth

Step 1: Run Compute on rows ( Galaxy version 2.0) with the following parameters:

  1. param-file “Input file”: model_params (output of GenomeScope tool)
  2. For “1: Expressions”:
    • “Add expression”: round(1.5*c3)
    • “Mode of the operation”: Append
  3. Click galaxy-wf-new Insert Expressions
  4. For “2: Expressions”:
    • “Add expression”: 3*c7
    • “Mode of the operation”: Append

Step 2: Rename it as Parsing purge parameters

Step 3: Run Advanced Cut ( Galaxy version 1.1.0) with the following parameters:

  1. param-file “File to cut”: Parsing purge parameters
  2. “Cut by”: fields
  3. “List of Fields”: Column: 8

Step 4: Rename the output as Maximum depth

Question
  1. What is the estimated maximum depth?
  2. What does this parameter represent?
  1. The estimated maximum depth is 114 reads.
  2. The maximum depth indicates the maximum number of sequencing reads that align to specific positions in the genome.

Now let’s get the transition parameter.

Step 5: Run Advanced Cut ( Galaxy version 1.1.0) with the following parameters:

  1. param-file “File to cut”: Parsing purge parameters
  2. “Cut by”: fields
  3. “List of Fields”: Column: 7

Step 6: Rename the output as Transition parameter

Question

What is the estimated transition parameter?

The estimated transition parameter is 38 reads.

Purging with purge_dups

An ideal haploid representation would consist of one allelic copy of all heterozygous regions in the two haplomes, as well as all hemizygous regions from both haplomes (Guan et al. 2019). However, in highly heterozygous genomes, assembly algorithms are frequently not able to identify the highly divergent allelic sequences as belonging to the same region, resulting in the assembly of those regions as separate contigs. This can lead to issues in downstream analysis, such as scaffolding, gene annotation and read mapping in general (Small et al. 2007, Guan et al. 2019, Roach et al. 2018). In order to solve this problem, we are going to use purge_dups; this tool will allow us to identify and reassign allelic contigs.

This stage consists of three substages: read-depth analysis, generation of all versus all self-alignment and resolution of haplotigs and overlaps (fig. 8).

Post-processing with purge_dups. Open image in new tab

Figure 18: Purge_dups pipeline. Adapted from github.com/dfguan/purge_dups. Purge_dups is integrated in a multi-step pipeline consisting in three main substages. Red indicates the steps which require to use Minimap2.

Read-depth analysis

Initially, we need to collapse our HiFi trimmed reads collection into a single dataset.

Hands-on: Collapse the collection

Step 1: Run Collapse Collection ( Galaxy version 4.2) with the following parameters:

  • param-collection “Collection of files to collapse into single dataset”:HiFi_collection (trim)

Step 2: Rename the output as HiFi reads collapsed

Now, we will map the reads against the primary assembly by using Minimap2 (Li 2018), an alignment program designed to map long sequences.

Hands-on: Map the reads to contigs with Minimap2

*Step 1: Run Map with minimap2 ( Galaxy version 2.17+galaxy4) with the following parameters:

  1. “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
  2. param-file “Use the following dataset as the reference sequence”: Primary contigs FASTA
  3. “Single or Paired-end reads”: Single
  4. param-collection “Select fastq dataset”: HiFi reads collapsed
  5. “Select a profile of preset options”: Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)
  6. In “Set advanced output options” set “Select an output format”: PAF

Step 2: Rename the output as Reads mapped to contigs

Finally, we will use the Reads mapped to contigs pairwise mapping format (PAF) file for calculating some statistics required in a later stage. In this step, purge_dups (listed as Purge overlaps in Galaxy tool panel) initially produces a read-depth histogram from base-level coverages. This information is used for estimating the coverage cutoffs, taking into account that collapsed haplotype contigs will lead to reads from both alleles mapping to those contigs, whereas if the alleles have assembled as separate contigs, then the reads will be split over the two contigs, resulting in half the read-depth (Roach et al. 2018).

Hands-on: Read-depth analisys

Step 1: Run Purge overlaps ( Galaxy version 1.2.6+galaxy0) with the following parameters:

  1. “Function mode”: Calculate coverage cutoff, base-level read depth and create read depth histogram for PacBio data (calcuts+pbcstats)
  2. param-file “PAF input file”: Reads mapped to contigs
  3. In “Calcuts options”:
    • “Transition between haploid and diploid”: 38
    • “Upper bound for read depth”: 114 (the previously estimated maximum depth)
    • “Ploidy”: Diploid

Step 2: Rename the outputs as PBCSTAT base coverage primary, Histogram plot primary and Calcuts cutoff primary.

Purge overlaps (purge_dups) generates three outputs:

  • PBCSTAT base coverage: it contains the base-level coverage information.
  • Calcuts-cutoff: it includes the thresholds calculated by purge_dups.
  • Histogram plot.

Generation of all versus all self-alignment

Now, we will segment the draft assembly into contigs by cutting at blocks of Ns, and use minimap2 to generate an all by all self-alignment.

Hands-on: purge_dups pipeline

Step 1: Run Purge overlaps ( Galaxy version 1.2.6+galaxy0) with the following parameters:

  1. “Function mode”: split assembly FASTA file by 'N's (split_fa)
  2. param-file “Assembly FASTA file”: Primary contigs FASTA

Step 2: Rename the output as Split FASTA

Step 3: Run Map with minimap2 ( Galaxy version 2.17+galaxy4) with the following parameters:

  1. “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
  2. param-file “Use the following dataset as the reference sequence”: Split FASTA
  3. “Single or Paired-end reads”: Single
  4. param-file “Select fastq dataset”: Split FASTA
  5. “Select a profile of preset options”: Construct a self-homology map - use the same genome as query and reference (-DP -k19 -w 19 -m200) (self-homology)
  6. In “Set advanced output options”: set “Select an output format” to PAF

Step 4: Rename the output as Self-homology map primary

Resolution of haplotigs and overlaps

During the final step of the purge_dups pipeline, it will use the self alignments and the cutoffs for identifying the haplotypic duplications.

In order to identify the haplotypic duplications, purge_dups uses the base-level coverage information to flag the contigs according to the following criteria:

  • If more than 80% bases of a contig are above the high read depth cutoff or below the noise cutoff, it is discarded.
  • If more than 80% bases are in the diploid depth interval, it is labeled as a primary contig, otherwise it is considered further as a possible haplotig.

Contigs that were flagged for further analysis according to read-depth are then evaluated to attempt to identify synteny with its allelic companion contig. In this step, purge_dups uses the information contained in the self alignments:

  • If the alignment score is larger than the cutoff s (default 70), the contig is marked for reassignment as haplotig. Contigs marked for reassignment with a maximum match score greater than the cutoff m (default 200) are further flagged as repetitive regions.

  • Otherwise contigs are considered as a candidate primary contig.

Once all matches associated with haplotigs have been removed from the self-alignment set, purge_dups ties consistent matches between the remaining candidates to find collinear matches, filtering all the matches whose score is less than the minimum chaining score l.

Finally, purge_dups calculates the average coverage of the matching intervals for each overlap, and mark an unambiguous overlap as heterozygous when the average coverage on both contigs is less than the read-depth cutoff, removing the sequences corresponding to the matching interval in the shorter contig.

Hands-on: Resolution of haplotigs and overlaps

Step 1: Purge overlaps ( Galaxy version 1.2.6+galaxy0) with the following parameters:

  1. “Select the purge_dups function”: Purge haplotigs and overlaps for an assembly (purge_dups)
  2. param-file “PAF input file”: Self-homology map primary
  3. param-file “Base-level coverage file”: PBCSTAT base coverage primary
  4. param-file “Cutoffs file”: calcuts cutoff primary

Step 2: Rename the output as purge_dups BED

Step 3: Purge overlaps ( Galaxy version 1.2.6+galaxy0) with the following parameters:

  1. “Select the purge_dups function”: Obtain sequences after purging (get_seqs)
  2. param-file “Assembly FASTA file”: Primary contigs FASTA
  3. param-file “BED input file”: purge_dups BED (output of the previous step)

Step 4: Rename the output get_seq purged sequences as Primary contigs purged and the get_seq haplotype file as Alternate haplotype contigs

Process the alternate assembly

Now we should repeat the same procedure with the alternate contigs generated by hifiasm. In that case, we should start by merging the Alternate haplotype contigs generated in the previous step and the Alternate contigs FASTA file.

Hands-on: Merge the purged sequences and the Alternate contigs

Step 1: Concatenate datasets with the following parameters:

  1. param-file “Concatenate Dataset”: Alternate contigs FASTA
  2. In “Dataset”:
  3. param-repeat “Insert Dataset”
  4. param-file “Select”: Alternate haplotype contigs
Comment

Remember that the Alternate haplotype contigs file contains those contigs that were considered to be haplotypic duplications of the primary contigs.

Step 2: Rename the output as Alternate contigs full

Once we have merged the files, we should run the purge_dups pipeline again, but using the Alternate contigs full file as input.

Hands-on: Process the alternate assembly with purge_dups

Step 1: Run Map with minimap2 ( Galaxy version 2.17+galaxy4) with the following parameters:

  1. “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
  2. param-file “Use the following dataset as the reference sequence”: Alternate contigs full
  3. “Single or Paired-end reads”: Single
  4. param-collection “Select fastq dataset”: HiFi reads collapsed
  5. “Select a profile of preset options”: Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5) (Note asm5 at the end!)
  6. In “Set advanced output options” set “Select an output format” to PAF

Step 2: Rename the output as Reads mapped to contigs alternate

Step 3: Purge overlaps ( Galaxy version 1.2.6+galaxy0) with the following parameters:

  1. “Function mode”: Calculate coverage cutoff, base-level read depth and create read depth histogram for PacBio data (calcuts+pbcstats)
  2. param-file “PAF input file”: Reads mapped to contigs alternate
  3. In “Calcuts options”:
  4. “Upper bound for read depth”: 114
  5. “Ploidy”: Diploid

Step 4: Rename the outputs as PBCSTAT base coverage alternate, Histogram plot alternate and Calcuts cutoff alternate.

Step 5: Run Purge overlaps ( Galaxy version 1.2.6+galaxy0) with the following parameters:

  1. “Function mode”: split assembly FASTA file by 'N's (split_fa)
  2. param-file “Assembly FASTA file”: Alternate contigs full

Step 5: Rename the output as Split FASTA alternate

Step 6: Run Map with minimap2 ( Galaxy version 2.17+galaxy4) with the following parameters:

  1. “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
  2. param-file “Use the following dataset as the reference sequence”: Split FASTA alternate
  3. “Single or Paired-end reads”: Single
  4. param-file “Select fastq dataset”: Split FASTA alternate
  5. “Select a profile of preset options”: Construct a self-homology map - use the same genome as query and reference (-DP -k19 -w 19 -m200) (self-homology)
  6. In “Set advanced output options” set “Select an output format” to PAF

Step 7: Rename the output as Self-homology map alternate

Step 8: Run Purge overlaps ( Galaxy version 1.2.6+galaxy0) with the following parameters:

  1. “Select the purge_dups function”: Purge haplotigs and overlaps for an assembly (purge_dups)
  2. param-file “PAF input file”: Self-homology map alternate
  3. param-file “Base-level coverage file”: PBCSTAT base coverage alternate
  4. param-file “Cutoffs file”: calcuts cutoff alternate

Step 9: Rename the output as purge_dups BED alternate

Step 10: Run Purge overlaps ( Galaxy version 1.2.6+galaxy0) with the following parameters:

  1. “Select the purge_dups function”: Obtain sequences after purging (get_seqs)
  2. param-file “Assembly FASTA file”: Alternate contigs full
  3. param-file “BED input file”: purge_dups BED alternate

Step 11: Rename the outputs as Alternate contigs purged and Alternate haplotype contigs.

Post-purge quality control

Recall that, prior to purging, our primary assembly showed it needed to be purged due to several quality metrics: there was a large amount of BUSCO duplicated genes, many 2-copy k-mers that were primary-only instead of shared, and the primary was ~2/3 times larger than the expected genome size. Let’s check if these metrics changed after purging.

Hands-on: Evaluating the purged assemblies

Step 1: Run gfastats ( Galaxy version 1.3.6+galaxy0) with the following parameters:

  1. param-files “Input file”: select Primary contigs purged and the Alternate contigs purged datasets
  2. “Expected genome size”: 11747160 (remember we calculated this value earlier, so it should be in your history!)

Step 2:. Rename the outputs as Primary purged stats and Alternate purged stats

Step 3: Run Column join ( Galaxy version 0.0.3) with the following parameters:

  • param-files “Input file”: select Primary purged stats and the Alternate purged stats datasets

Step 4: Rename the output as gfastats on purged pri and alt (full)

Step 5: Search in textfiles ( Galaxy version 1.1.1) with the following parameters:

  1. param-files “Input file”: select gfastats on purged pri and alt (full)
  2. “that”: Don't Match
  3. “Type of regex”: Basic
  4. “Regular Expression”: [Ss]caffold

Step 6: Rename the output as gfastats on purged pri and alt contigs

Step 7: Busco ( Galaxy version 5.5.0+galaxy0) with the following parameters:

  1. param-files “Sequences to analyze”: Primary contigs purged
  2. “Mode”: Genome assemblies (DNA)
  3. “Use Augustus instead of Metaeuk”: Use Metaeuk
  4. “Auto-detect or select lineage?”: Select lineage
  5. “Lineage”: Saccharomycetes
  6. “Which outputs should be generated”: short summary text and summary image

Step 8: Merqury ( Galaxy version 1.3+galaxy3) with the following parameters:

  1. “Evaluation mode”: Default mode
  2. param-file “k-mer counts database”: Merged meryldb
  3. “Number of assemblies”: `Two assemblies
  4. param-file “First genome assembly”: Primary contigs purged
  5. param-file “Second genome assembly”: Alternate contigs purged

The summary statistics indicate that both assemblies are now of a similar size that is much closer to the expected genome size: the purged primary assembly is 17 contigs totalling ~12.1 Mbp, while the purged alternate assembly is 16 contigs that amount to ~11.3 Mbp. This makes sense since, during purging, we removed sequences from the primary assembly and added them to the alternate assembly. But did we remove the right sequences? Remember, the sequences we are aiming to re-distribute are the false duplications that were flagged by BUSCO.

BUSCO for primary assembly after purging.Open image in new tab

Figure 19: BUSCO for the primary assembly after purging.

The BUSCO results for the purged primary assembly look much better, since we no longer have the large amount of duplicate BUSCOs that we previously had. Additionally, there is no large increase in missing BUSCOs, indicating that we have not over-purged the primary assembly.

The previous metrics tell us that the primary is likely fixed after purging, but what about the previously incomplete alternate assembly? Let’s see if the Merqury spectra plots show any change in how k-mers are split up between the two assemblies:

Merqury spectra-asm plot after purging.Open image in new tab

Figure 20: Merqury ASM plot after purging.

This looks a lot better! The diploid regions are all shared between the two assemblies (the large green peak centered at 50x, the diploid coverage value), and the haplotypic variation is shared between the primary and alternate assemblies (the red and blue peaks centered around 25✕, the haploid coverage value).

Merqury spectra-cn plot for primary assembly after purging.Open image in new tab

Figure 21: Merqury CN plot for the primary assembly only after purging.

Additionally, when we look at the primary-only CN plot, we see that the large peak of 2-copy k-mers is now gone, since those regions are now represented at 1-copy only in the primary assembly, as they should be.

Scaffolding

At this point, we have a set of contigs, which may or may not be fully phased, depending on how we ran hifiasm. Next, the contigs will be assembled into scaffolds, i.e., sequences of contigs interspaced with gaps. The VGP pipeline currently scaffolds using two additional technologies: Bionano optical maps and Hi-C data.

Comment: What assembly am I scaffolding??

For the purposes of this tutorial, the scaffolding hands-on exercises will be referring to a Hap1 assembly produced with Hi-C mode of hifiasm. If you have hap1 contigs or hap2 contigs, then you can also follow along just using Primary purged contigs or Alternate purged contigs. Wherever the tutorial refers to primary contigs, just replace it with whichever haplotype you are scaffolding.

Schematic of scaffolding contigs to generate a draft assembly, and then curating the draft scaffolds to generate a curated assembly.Open image in new tab

Figure 22: Scaffolding uses additional information to join together contigs. Optical maps, such as those produced using the Bionano Saphyr, as well as Hi-C data such as those generated by Arima Hi-C or Dovetail OmniC kits, are used to re-order and orient contigs according to long-range information. This generates the draft scaffold-level genome. This draft genome then undergoes manual curation, which uses information such as coverage tracks or repeat annotations in order to further orient scaffolds and correct misassemblies such as mis-joins and missed joins. Image adapted from Rhie et al. 2020.

Hybrid scaffolding with Bionano optical maps

In this step, the linkage information provided by optical maps is integrated with primary assembly sequences, and the overlaps are used to orient and order the contigs, resolve chimeric joins, and estimate the length of gaps between adjacent contigs. One of the advantages of optical maps is that they can easily span genomic regions that are difficult to resolve using DNA sequencing technologies (Savara et al. 2021, Yuan et al. 2020).

Bionano technology relies on the isolation of kilobase-long DNA fragments, which are labeled at specific sequence motifs with a fluorescent dye, resulting in a unique fluorescent pattern for each genome. DNA molecules are stretched into nanoscale channels and imaged with a high-resolution camera, allowing us to build optical maps that include the physical locations of labels rather than base-level information (Lam et al. 2012, Giani et al. 2020, Savara et al. 2021).

Bionano optical maps. Open image in new tab

Figure 23: Bionano optical maps. Source: https://bionanogenomics.com

The average optical map molecule length, around 225 kbp, is substantially larger than the PacBio HiFi reads, with read lengths averaging 10-25 kbp.

The Bionano Hybrid Scaffold tool automates the scaffolding process, which includes five main steps:

  1. Generate in silico maps for sequence assembly.
  2. Align in silico sequence maps against Bionano genome maps to identify and resolve potential conflicts.
  3. Merge the non-conflicting maps into hybrid scaffolds.
  4. Align sequence maps to the hybrid scaffolds
  5. Generate AGP and FASTA files for the scaffolds.

Before we begin, we need to upload BioNano data:

Hands-on: Uploading BioNano datasets from Zenodo

Step 1: Copy the following URLs into clipboard. You can do this by clicking on copy button in the right upper corner of the box below. It will appear if you mouse over the box.

   https://zenodo.org/records/5887339/files/bionano.cmap

Step 2: Upload datasets into Galaxy

  • set the datatype to cmap

The box below explains how to upload data if you forgot. Just make sure you set dataset type to cmap.

  • Copy the link location
  • Click galaxy-upload Upload Data at the top of the tool panel

  • Select galaxy-wf-edit Paste/Fetch Data
  • Paste the link(s) into the text field

  • Change Type (set all): from “Auto-detect” to fasta

  • Press Start

  • Close the window

Hands-on: Bionano hybrid scaffolding

Step1: Run Bionano Hybrid Scaffold ( Galaxy version 3.7.0+galaxy3) with the following parameters:

  1. param-file “NGS FASTA”: Hap1 contigs FASTA generated during hifiasm contigging.
  2. param-file “BioNano CMAP”: Bionano_dataset we just uploaded
  3. “Configuration mode”: VGP mode
  4. “Genome maps conflict filter”: Cut contig at conflict
  5. “Sequences conflict filter”: Cut contig at conflict
Comment

If your data are not associated with VGP, make sure that the configuration mode fits with your samples.

Step 2: Run Concatenate datasets with the following parameters:

  1. param-file “Concatenate Dataset”: NGScontigs scaffold NCBI trimmed (output of Bionano Hybrid Scaffold tool)
  2. param-repeat “Insert Dataset”
  3. param-file “Select”: NGScontigs not scaffolded trimmed (output of Bionano Hybrid Scaffold tool)

Step 3: Rename the output as Hap1 assembly bionano

Evaluating Bionano scaffolds

Let’s evaluate our scaffolds to see the impact of scaffolding on some key assembly statistics.

Hands-on: Bionano assembly evaluation gfastats
  1. gfastats ( Galaxy version 1.3.6+galaxy0) with the following parameters:
    • param-files “Input file”: select Primary assembly bionano
    • “Expected genome size”: 11747160 (remember we calculated this value earlier, so it should be in your history!)
  2. Rename the outputs as Bionano stats.
Question
  1. How many scaffolds are in the primary assembly after the hybrid scaffolding?
  2. What is the size of the largest scaffold? Has improved with respect to the previous evaluation?
  1. The number of contigs is 17.
  2. The largest contig is 1,531,728 bp long. This value hasn’t changed. This is likely due to the fact that this is a downsampled training dataset.

Hi-C scaffolding

Hi-C is a sequencing-based molecular assay designed to identify regions of frequent physical interaction in the genome by measuring the contact frequency between all pairs of loci, allowing us to provide an insight into the three-dimensional organization of a genome (Dixon et al. 2012, Lieberman-Aiden et al. 2009). In this final stage, we will exploit the fact that the contact frequency between a pair of loci strongly correlates with the one-dimensional distance between them with the objective of linking the Bionano scaffolds to a chromosome scale.

The high-throughput chromosome conformation capture (Hi-C) technology is based on the capture of the chromatin in three-dimensional space. During Hi-C library preparation, DNA is crosslinked in its 3D conformation. Then, the DNA is digested using restriction enzymes, and the digested ends are filled with biotinylated nucleotides. The biotinylated nucleotides enable the specific purification of the ligation junctions, preventing the sequencing of DNA molecules that do not contain such junctions which are thus mostly uninformative (Lajoie et al. 2015).

Hi-C protocol. Open image in new tab

Figure 24: Hi-C protocol. Adapted from Rao et al. 2014

Next, the blunt ends of the spatially proximal digested end are ligated. Each DNA fragment is then sequenced from each end of this artificial junction, generating read pairs. This provides contact information that can be used to reconstruct the proximity of genomic sequences belonging to the same chromosome (Giani et al. 2020). Hi-C data are in the form of two-dimensional matrices (contact maps) whose entries quantify the intensity of the physical interaction between genome regions.

Hi-C scaffolding and haplotyping logic.Open image in new tab

Figure 25: Schematic of how Hi-C information is used to scaffold and orient contigs along chromosomes using long-range information, as well as link separated haplotype blocks into chromosome-scale haplotypes Korbel and Lee 2013.

Pre-processing Hi-C data

Despite Hi-C generating paired-end reads, we need to map each read separately. This is because most aligners assume that the distance between paired-end reads fit a known distribution, but in Hi-C data the insert size of the ligation product can vary between one base pair to hundreds of megabases (Lajoie et al. 2015).

Hands-on: Mapping Hi-C reads

Step 1: Run BWA-MEM2 ( Galaxy version 2.2.1+galaxy1) with the following parameters:

  1. “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
  2. param-file “Use the following dataset as the reference sequence”: Hap1 assembly bionano
  3. “Single or Paired-end reads”: Single
  4. param-file “Select fastq dataset”: Hi-C_dataset_F
  5. “Set read groups information?”: Do not set
  6. “Select analysis mode”: 1.Simple Illumina mode
  7. “BAM sorting mode”: Sort by read names (i.e., the QNAME field)

Step 2: Rename the output as BAM forward

Now let’s do the same for reverse Hi-C reads:

Step 3: Run BWA-MEM2 ( Galaxy version 2.2.1+galaxy1) with the following parameters:

  1. “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
  2. param-file “Use the following dataset as the reference sequence”: Hap1 assembly bionano
  3. “Single or Paired-end reads”: Single
  4. param-file “Select fastq dataset”: Hi-C_dataset_R
  5. “Set read groups information?”: Do not set
  6. “Select analysis mode”: 1.Simple Illumina mode
  7. “BAM sorting mode”: Sort by read names (i.e., the QNAME field)

Step 4: Rename the output as BAM reverse

  1. Filter and merge ( Galaxy version 1.0+galaxy1) chimeric reads from Arima Genomics with the following parameters:
    • param-file “First set of reads”: BAM forward
    • param-file “Second set of reads”: BAM reverse
  2. Rename it as BAM Hi-C reads

Generate initial Hi-C contact map

After mapping the Hi-C reads, the next step is to generate an initial Hi-C contact map, which will allow us to compare the Hi-C contact maps before and after using the Hi-C for scaffolding.

Comment: Biological basis of Hi-C contact maps

Hi-C contact maps reflect the interaction frequency between genomic loci. In order to understand the Hi-C contact maps, it is necessary to take into account two factors: the higher interaction frequency between loci that reside in the same chromosome (i.e., in cis), and the distance-dependent decay of interaction frequency (Lajoie et al. 2015).

The higher interaction between cis regions can be explained, at least in part, by the territorial organization of chromosomes in interphase (chromosome territories), and in a genome-wide contact map, this pattern appears as blocks of high interaction centered along the diagonal and matching individual chromosomes (Cremer and Cremer 2010, Lajoie et al. 2015):

Hi-C map. Open image in new tab

Figure 26: An example of a Hi-C map. Genomic regions are arranged along the X and Y axes, and contacts are colored on the matrix like a heat map; here darker color indicates greater interaction frequency.

On the other hand, the distance-dependent decay may be due to random movement of the chromosomes, and in the contact map appears as a gradual decrease of the interaction frequency the farther away from the diagonal it moves (Lajoie et al. 2015).

Hands-on: Generate a contact map with PretextMap and Pretext Snapshot

Step 1: Run PretextMap ( Galaxy version 0.1.9+galaxy0) with the following parameters:

  1. param-file “Input dataset in SAM or BAM format”: BAM Hi-C reads
  2. “Sort by”: Don't sort

Step 2: Rename the output as PretextMap output

Step 3: Run Pretext Snapshot ( Galaxy version 0.0.3+galaxy1) with the following parameters:

  1. param-file “Input Pretext map file”: PretextMap output
  2. “Output image format”: png
  3. “Show grid?”: Yes

Let’s have a look at the Hi-C contact maps generated by Pretext Snapshot.

Pretext optical map. Open image in new tab

Figure 27: Hi-C map generated by Pretext. Primary assembly full contact map generated in this training (a) Hi-C map representative of a typical missasembly (b).

In the contact generated from the Bionano-scaffolded assembly can be identified 17 scaffolds, representing each of the haploid chromosomes of our genome (panel a above). The fact that all the contact signals are found around the diagonal suggest that the contigs were scaffolded in the right order. However, during the assembly of complex genomes, it is common to find in the contact maps indicators of errors during the scaffolding process, as shown in panel b. In that case, a contig belonging to the second chromosome has been misplaced as part of the fourth chromosome. We can also note that the final portion of the second chromosome should be placed at the beginning, as the off-diagonal contact signal suggests.

Once we have evaluated the quality of the scaffolded genome assembly, the next step consists in integrating the information contained in the HiC reads into our assembly, so that any errors identified can be resolved. For this purpose we will use YaHS (Zhou et al. 2022).

YaHS scaffolding

YaHS is an open source software that makes use of Hi-C to linearly orient and order assembled contigs along entire chromosomes (Zhou et al. 2022). One of the advantages of Yahs with respect to most existing Hi-C scaffolding tools is that it doesn’t require the estimated number of chromosomes.

Comment: YAHS algorithm overview

YaHS takes the alignment between Hi-C reads and contigs, breaking contigs at positions lacking Hi-C coverage to address potential assembly errors.

Scaffolding occurs in multiple rounds, with YaHS creating a contact matrix for each contig by splitting it into chunks. Hi-C contact signals are assigned to cells within contigs (intra-cells) and between contigs (inter-cells; see figure below). Joining scores for contig pairs are calculated based on normalized contact frequencies, aiming for similar inter-cell frequencies between neighboring contigs.

Optionally considering Hi-C library restriction enzymes, YaHS normalizes contact frequencies by the corresponding number of cutting sites. A scaffolding graph is constructed with contigs as nodes and contig joins as weighted edges. The graph undergoes simplification steps, including edge filtering, tip and blunt end trimming, repeat resolution, transitive edge removal, bubble and ambiguous orientation resolution, weak edge trimming, and ambiguous edge removal.

Finally, the graph is traversed to assemble scaffolds along continuous paths. Optionally, a second step of assembly error correction breaks scaffolds at positions lacking sufficient Hi-C coverage. YaHS employs a hierarchical joining process with multiple rounds of scaffolding at decreasing resolutions (increasing chunk sizes), using previous round scaffolds as input.

 An overview of YaHS. Open image in new tab

Figure 28: Schematic of how Hi-C information is used to scaffold and orient contigs along chromosomes using long-range information, as well as link separated haplotype blocks into chromosome-scale haplotypes (from Zhou et al. 2022

See Zhou et al. 2022 for exact details.

Now we can launch YaHS in order to generate the hybrid scaffolding based on the Hi-C data.

Hands-on: YaHS scaffolding

Step 1: Run YaHS ( Galaxy version 1.2a.2+galaxy1) with the following parameters:

  1. param-file “Input contig sequences”: Hap1 assembly bionano
  2. param-file “Alignment file of Hi-C reads to contigs“*: BAM Hi-C reads
  3. “Restriction enzyme used in Hi-C experiment”: set to Enter a specific sequence
  4. “Restriction enzyme sequence(s)”: Enter CTTAAG

Step 2: Rename YAHS on data NNN and data NNN: Final scaffolds fasta output as YaHS Scaffolds FASTA

Evaluate the final genome assembly with Pretext

Finally, we should repeat the procedure described previously for generating the contact maps, but in that case, we will use the scaffold generated by YaHS.

Hands-on: Mapping reads against the scaffold

Step 1: Run BWA-MEM2 ( Galaxy version 2.2.1+galaxy1) with the following parameters:

  1. “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
  2. param-file “Use the following dataset as the reference sequence”: YaHS Scaffolds FASTA
  3. “Single or Paired-end reads”: Single
  4. param-file “Select fastq dataset”: Hi-C_dataset_F
  5. “Set read groups information?”: Do not set
  6. “Select analysis mode”: 1.Simple Illumina mode
  7. “BAM sorting mode”: Sort by read names (i.e., the QNAME field)

Step 2: Rename the output as BAM forward YaHS

Step 3: Run BWA-MEM2 ( Galaxy version 2.2.1+galaxy1) with the following parameters:

  1. “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index
  2. param-file “Use the following dataset as the reference sequence”: YaHS Scaffolds FASTA
  3. “Single or Paired-end reads”: Single
  4. param-file “Select fastq dataset”: Hi-C_dataset_R
  5. “Set read groups information?”: Do not set
  6. “Select analysis mode”: 1.Simple Illumina mode
  7. “BAM sorting mode”: Sort by read names (i.e., the QNAME field)

Step 4: Rename the output as BAM reverse YaHS

Step 5: Run Filter and merge ( Galaxy version 1.0+galaxy1) chimeric reads from Arima Genomics with the following parameters:

  1. param-file “First set of reads”: BAM forward YaHS
  2. param-file “Second set of reads”: BAM reverse YaHS

Step 6: Rename the output as BAM Hi-C reads SALSA2

Step 7: Run PretextMap ( Galaxy version 0.1.9+galaxy0) with the following parameters:

  1. param-file “Input dataset in SAM or BAM format”: BAM Hi-C reads YaHS
  2. “Sort by”: Don't sort

Step 8: Rename the output as PretextMap output YaHS

Step 9: Run Pretext Snapshot ( Galaxy version 0.0.3+galaxy1) with the following parameters:

  1. param-file “Input Pretext map file”: PretextMap output YaHS
  2. “Output image format”: png
  3. “Show grid?”: Yes

In order to evaluate the Hi-C hybrid scaffolding, we are going to compare the contact maps before and after running YaHS:

Pretext final contact map. Open image in new tab

Figure 29: Hi-C map generated by Pretext after the hybrid scaffolding based on Hi-C data. The red circles indicate the differences between the contact map generated after (a) and before (b) Hi-C hybrid scaffolding.

Among the most notable differences that can be identified between the contact maps, it can be highlighted the regions marked with red circles, where inversion can be identified.

Conclusion

To sum up, it is worthwhile to compare the final assembly with the S. cerevisiae_ S288C reference genome.

Table 1: Final stats. Open image in new tab

Figure 30: Comparison between the final assembly generated in this training and the reference genome. Contiguity plot using the reference genome size (a). Assembly statistics (b).

With respect to the total sequence length, we can conclude that the size of our genome assembly is almost identical to the reference genome (figure above). Regarding the number of scaffolds, the obtained value is similar to the reference assembly, which consists of 16 chromosomes plus the mitochondrial DNA, which consists of 85,779 bp. The remaining statistics exhibit very similar values (panel b above).

Comparison reference genome. Open image in new tab

Figure 31: Comparison between contact maps generated by using the final assembly (a) and the reference genome (b).

If we compare the contact map of our assembled genome (panel a above) with the reference assembly (panel b above), we can see that the two are essentially identical. This means that we have achieved an almost perfect assembly at the chromosome level.