Assembly

Introduction

At this time, DNA sequencing technology cannot determine the complete nucleotide sequence of all chromosomal DNA strands from a sample of target organism DNA. Chromosomes are simply too large, so one must first physically break the DNA strands into pieces, sequence, and reconstruct the reference chromosome sequences.

Genome assemblers are complex software workflows that require high-performance computing resources, significant operator skill, and real experience in optimizing the genome reconstruction process, which can be very different depending upon the genome complexity, heterozygosity and ploidy of a given species.

At this train station, we discuss the process of putting end sequences from chromosome molecular fragments back together again into a reference genome assembly. It should be noted that no reference genome is perfect. It will have gaps and misassemblies. These experimental errors can sometimes be detected & corrected (i.e. finished) using bioinformatic and molecular biology approaches.

With this caveat that no genome assembly is perfect, it is up to the genome assembly team to decide what the acceptable level of experimental error can be allowed to address the requirements of the experimental design. Thus, we will also discuss objective tools to ensure the genome assembly is “good enough” and not just assume that it is of sufficient quality because it’s a “lot of data”.

Tools

Here are some useful software tools for genome assembly:

Overlap layout consensus. These assemblers use read alignments to generate consensus sequence, best for longer read technologies. Examples include:

Celera wgs-asssembler

Newbler (Roche)

String Graph Assembler

Graph-based assemblers. These assemblers split the sequence reads into kmers and derive consensus sequence from weighted graphs, best for highly accurate short read technologies. Examples include:

Velvet

ABySS

SOAPdenovo

AllPaths

Notes. Available computational resources need to be considered (for example, ABySS was designed for distributed computing systems, while SOAPdenovo runs best on single servers with multiple processors).

In recent years, there has been increasing interaction between experimental design and the development of assembly approaches (for example the ALLPATHS-LG assembler is specifically designed for the assembly of a specific combination of fragment and mate-pair libraries; Gnerre et al., 2011). For an overview of performing assemblies see Nagarajan and Pop (2013).

Training materials