We are searching data for your request:

**Forums and discussions:**

**Manuals and reference books:**

**Data from registers:**

**Wait the end of the search in all databases.**

Upon completion, a link will appear to access the found materials.

Upon completion, a link will appear to access the found materials.

I implemented own generic de Bruijn graph, which I use for DNA assembly (alphabet: A, C, G, T). I try to find purpose of de Bruijn graph for another bioinformatics problems, if some exists. I want use/modify my implementation to solve this. Are there another problems in bioinformatics which we can solve with de Bruijn graphs?

I found only RNA assembly, which is similar to DNA assembly.

## Assembly of long error-prone reads using de Bruijn graphs

When the long reads generated using single-molecule se-quencing (SMS) technology were made available, most researchers were skeptical about the ability of existing algorithms to generate high-quality assemblies from long error-prone reads. Nevertheless, recent algorithmic breakthroughs resulted in many successful SMS sequencing projects. However, as the recent assemblies of important plant pathogens illustrate, the problem of assembling long error-prone reads is far from being resolved even in the case of relatively short bacterial genomes. We propose an algorithmic approach for assembling long error-prone reads and describe the ABruijn assembler, which results in accurate genome reconstructions.

Euler, L. *Commentarii Academiae Scientiarum Petropolitanae* **8**, 128–140 (1741).

Skiena, S. *The Algorithm Design Manual* (Springer, Berlin, 2008).

Lander, E. et al. *Nature* **409**, 860–921 (2001).

Venter, J.C. et al. *Science* **291**, 1304–1351 (2001).

Kececioglu, J. & Myers, E. *Algorithmica* **13**, 7–51 (1995).

Adams, M. et al. *Science* **287**, 2185–2195 (2000).

Fleischmann, R. et al. *Science* **269**, 496–512 (1995).

Schatz, M., Delcher, A. & Salzberg, S. *Genome Res.* **20**, 1165–1173 (2010).

Bandeira, N., Pham, V., Pevzner, P., Arnott, D. & Lill, J. *Nat. Biotechnol.* **26**, 1336–1338 (2008).

Pham, S. & Pevzner, P.A. *Bioinformatics* **26**, 2509–2516 (2010).

Grabherr, M. et al. *Nat. Biotechnol.* **29**, 644–652 (2011).

de Bruijn, N. *Proc. Nederl. Akad. Wetensch.* **49**, 758–764 (1946).

Idury, R. & Waterman, M. *J. Comput. Biol.* **2**, 291–306 (1995).

Pevzner, P.A., Tang, H. & Waterman, M. *Proc. Natl. Acad. Sci. USA* **98**, 9748–9753 (2001).

Pevzner, P.A., Tang, H. & Tesler, G. *Genome Res.* **14**, 1786–1796 (2004).

Chaisson, M. & Pevzner, P.A. *Genome Res.* **18**, 324–330 (2008).

Zerbino, D. & Birney, E. *Genome Res.* **18**, 821–829 (2008).

Butler, J. et al. *Genome Res.* **18**, 810–820 (2008).

Simpson, J. et al. *Genome Res.* **19**, 1117–1123 (2009).

Li, R. et al. *Genome Res.* **20**, 265–272 (2010).

Paszkiewicz, K. & Studholme, D. *Brief. Bioinform.* **11**, 457–472 (2010).

Miller, J., Koren, S. & Sutton, G. *Genomics* **95**, 315–327 (2010).

Drmanac, R., Labat, I., Brukner, I. & Crkvenjakov, R. *Genomics* **4**, 114–128 (1989).

Southern, E. United Kingdom patent application gb8810400 (1988).

Lysov, Y. et al. *Doklady Academy Nauk USSR* **303**, 1508–1511 (1988).

Pevzner, P.A. *J. Biomol. Struct. Dyn.* **7**, 63–73 (1989).

## De-Bruijn graph with MapReduce framework towards metagenomic data classification

Metagenomic gene classifications are significant in bioinformatics and computational biology research. There are huge interrelated datasets that deal with human characteristics, diseases and molecular functionalities. Analysis of metagenomic reorganization is a challenging issue due to their diversity and efficient classification tools. Graph based MapReducing approach can easily handle the genomic diversity. MapReduce has two parts such as mapping and reducing. In mapping phase, a recursive naive algorithm is used for generating K-mers. De-Bruijn graph is a compact representation of k-mers and finds out an optimal path (solution) for genome assembly. Similarity metrics have been utilized for finding similarity among the De-Oxy Ribonucleic Acid (DNA) sequences. In reducing side, Jaccard similarity and purity of clustering are used as datasets classifier to classify the sequences based on their similarity. Reducing phase can easily classify the DNA sequences from large database. Extensive experimental analysis has demonstrated that graph based MapReduce analysis generate optimal solutions. Remarkable improvements in time and space have recorded and observed. The results established that proposed framework performed faster than SSMA-SFSD when classified elements are increased. It provided better accuracy for metagenomic data clustering.

There are two types of algorithms that are commonly utilized by these assemblers: greedy, which aim for local optima, and graph method algorithms, which aim for global optima. Different assemblers are tailored for particular needs, such as the assembly of (small) bacterial genomes, (large) eukaryotic genomes, or transcriptomes.

**Greedy algorithm assemblers** are assemblers that find local optima in alignments of smaller reads. Greedy algorithm assemblers typically feature several steps: 1) pairwise distance calculation of reads, 2) clustering of reads with greatest overlap, 3) assembly of overlapping reads into larger contigs, and 4) repeat. These algorithms typically do not work well for larger read sets, as they do not easily reach a global optimum in the assembly, and do not perform well on read sets that contain repeat regions. [1] Early de novo sequence assemblers, such as SEQAID [2] (1984) and CAP [3] (1992), used greedy algorithms, such as overlap-layout-consensus (OLC) algorithms. These algorithms find overlap between all reads, use the overlap to determine a layout (or tiling) of the reads, and then produce a consensus sequence. Some programs that used OLC algorithms featured filtration (to remove read pairs that will not overlap) and heuristic methods to increase speed of the analyses.

**Graph method assemblers** [4] come in two varieties: string and De Bruijn. String graph and De Bruijn graph method assemblers were introduced at a DIMACS [5] workshop in 1994 by Waterman [6] and Gene Myers. [7] These methods represented an important step forward in sequence assembly, as they both use algorithms to reach a global optimum instead of a local optimum. While both of these methods made progress towards better assemblies, the De Bruijn graph method has become the most popular in the age of next-generation sequencing. During the assembly of the De Bruijn graph, reads are broken into smaller fragments of a specified size, k. The k-mers are then used as nodes in the graph assembly. Nodes that overlap by some amount (generally, k-1) are then connect by an edge. The assembler will then construct sequences based on the De Bruijn graph. De Bruijn graph assemblers typically perform better on larger read sets than greedy algorithm assemblers (especially when they contain repeat regions).

Technologies Author Presented /

Different assemblers are designed for different type of read technologies. Reads from second generation technologies (called short read technologies) like Illumina are typically short (with lengths of the order of 50-200 base pairs) and have error rates of around 0.5-2%, with the errors chiefly being substitution errors. However, reads from third generation technologies like PacBio and fourth generation technologies like Oxford Nanopore (called long read technologies) are longer with read lengths typically in the thousands or tens of thousands and have much higher error rates of around 10-20% with errors being chiefly insertions and deletions. This necessitates different algorithms for assembly from short and long read technologies.

There are numerous programs for de novo sequence assembly and many have been compared in the Assemblathon. The Assemblathon is a periodic, collaborative effort to test and improve the numerous assemblers available. Thus far, two assemblathons have been completed (2011 and 2013) and a third is in progress (as of April 2017). Teams of researchers from across the world choose a program and assemble simulated genomes (Assemblathon 1) and the genomes of model organisms whose that have been previously assembled and annotated (Assemblathon 2). The assemblies are then compared and evaluated using numerous metrics.

### Assemblathon 1 Edit

Assemblathon 1 [22] was conducted in 2011 and featured 59 assemblies from 17 different groups and the organizers. The goal of this Assembalthon was to most accurately and completely assemble a genome that consisted of two haplotypes (each with three chromosomes of 76.3, 18.5, and 17.7 Mb, respectively) that was generated using Evolver. Numerous metrics were used to assess the assemblies, including: NG50 (point at which 50% of the total genome size is reached when scaffold lengths are summed from the longest to the shortest), LG50 (number of scaffolds that are greater than, or equal to, the N50 length), genome coverage, and substitution error rate.

- Software compared: ABySS, Phusion2, phrap, Velvet, SOAPdenovo, PRICE, ALLPATHS-LG
- N50 analysis: assemblies by the Plant Genome Assembly Group (using the assembler Meraculous) and ALLPATHS, Broad Institute, USA (using ALLPATHS-LG) performed the best in this category, by an order of magnitude over other groups. These assemblies scored an N50 of >8,000,000 bases.
- Coverage of genome by assembly: for this metric, BGI's assembly via SOAPdenovo performed best, with 98.8% of the total genome being covered. All assemblers performed relatively well in this category, with all but three groups having coverage of 90% and higher, and the lowest total coverage being 78.5% (Dept. of Comp. Sci., University of Chicago, USA via Kiki).
- Substitution errors: the assembly with the lowest substitution error rate was submitted by the Wellcome Trust Sanger Institute, UK team using the software SGA.
- Overall: No one assembler performed significantly better in others in all categories. While some assemblers excelled in one category, they did not in others, suggesting that there is still much room for improvement in assembler software quality.

### Assemblathon 2 Edit

Assemblathon 2 [23] improved on Assemblathon 1 by incorporating the genomes of multiples vertebrates (a bird (*Melopsittacus undulatus*), a fish (*Maylandia zebra*), and a snake (*Boa constrictor constrictor*)) with genomes estimated to be 1.2, 1.0, and 1.6Gbp in length) and assessment by over 100 metrics. Each team was given four months to assemble their genome from Next-Generation Sequence (NGS) data, including Illumina and Roche 454 sequence data.

- Software compared: ABySS, ALLPATHS-LG, PRICE, Ray, and SOAPdenovo
- N50 analysis: for the assembly of the bird genome, the Baylor College of Medicine Human Genome Sequencing Center and ALLPATHS teams had the highest NG50s, at over 16,000,000 and over 14,000,000 bp, respectively.
- Presence of core genes: Most assemblies performed well in this category (

## Additional marking structure for graph traversal

Many NGS applications, e.g. *de novo* assembly of genomes [15] and transcriptomes [2], and *de novo* variant detection [6], rely on (i) simplifying and (ii) traversing the de Bruijn graph. However, the graph as represented in the previous section neither supports (i) simplifications (as it is immutable) nor (ii) traversals (as the Bloom filter cannot store an additional visited bit per node). To address the former issue, we argue that the simplification step can be avoided by designing a slightly more complex traversal procedure [16].

We introduce a novel, lightweight mechanism to record which portions of the graph have already been visited. The idea behind this mechanism is that not every node needs to be marked. Specifically, nodes that are inside simple paths (i.e nodes having an in-degree of 1 and an out-degree of 1) will either be all marked or all unmarked. We will refer to nodes having their in-degree or out-degree different to 1 as *complex* nodes. We propose to store marking information of complex nodes, by explicitly storing complex nodes in a separate hash table. In de Bruijn graphs of genomes, the complete set of nodes dwarfs the set of complex nodes, however the ratio depends on the genome complexity [17]. The memory usage of the marking structure is *n*_{c}*C*, where *n*_{c} is the number of complex nodes in the graph and *C* is the memory usage of each entry in the hash table (*C*≈2*k*+8).

## Byte Size Biology

I recently applied for a Moore Foundation grant in Data Science for the biological sciences. As part of the pre-application, I was asked to choose the top 5 works in data science in my field. Not so sure about data science, so I picked what I think are the most influential works in Bioinformatics, which is what my proposal was about. Anyhow, the choice was tough, and I came up with the following. The order in which I list the works is chronological, as I make no attempt to rank them. If you ask me in the comments “How could you choose X over Y?” my reply would probably be: “I didn’t”.

**Dayhoff , M.O., Eck RV, and Eck CM. 1972. A model of evolutionary change in proteins. Pp. 89-99 in** **Atlas of protein sequence and structure, vol. 5, National Biomedical Research Foundation,** **Washington D.C**

Summary: this is the introduction of the PAM matrix, the paper that set the stage for our understanding of molecular evolution at the protein level, sequence alignment, and the BLASTing we all do. The question the asked: how can we quantify the changes between protein sequences? How can we develop a system that tells us, over time, the way proteins evolve? Dayhoff developed an elegant statistical method do so, which she named PAM, “Accepted Point Mutations”. She aligned hundreds of proteins and derived the frequency with which the different amino acids substitute each other. Dayhoff introduced a more robust version [PDF] in 1978, once the number of proteins she could use was enlarged for her to count a large number of substitutions.

**Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.**

BLAST, Basic Local Alignment Search Tool is the go-to computational workhorse in molecular biology. It is the most cited paper in life sciences, so probably the most influential paper in biology today. For the uninitiated: BLAST allows you to take a sequence of protein or DNA, and quickly search for similar sequences in a database containing millions. The search using one sequence takes seconds, or a few minutes at best. BLAST was actually introduced in another paper in 1990. However, the heuristics developed here allowed for the gapped alignment of sequences, and for searching for sequences which are less similar, with statistical robustness. BLAST changed everything in molecular biology, and moved biology to the data-rich sciences. If ever there was a case for giving the Nobel in Physiology or Medicine to a computational person, BLAST is it.

**Durbin R., Eddy S., Krogh A and Mitchison G Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Cambridge University Press 1998**

The Moore Foundation solicitation asked for “works” rather than just “research papers”. If there is anything common to all bioinformatics labs, it’s this book. An overview of the basic sequence analysis methods. This books summarizes the pre-2000 foundation upon which almost all our knowledge is currently built: pairwise alignment, Markov Models, multiple sequence alignment, profiles, PSSMs, and phylogenetics.

**Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genetics 25: 25-29**

Not a research paper, and not a book, but a “commentary”. This work popularized to the use of ontologies in bioinformatics and cemented GO as the main ontology we use.

**Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001 Aug 1498(17):9748-53**.

Sequence assembly using de-Bruijn graphs, making the assembly tractable for a large number of sequences. At the time, shotgun sequences produced by by Sanger sequencing could still be assembled in a finite time solving for a Hamiltonian path . Once next-generation sequencing data started pouring in, the use of de-Bruijn graphs and a Eulerian path became essential. For a great explanation of the methodological transition see this article in *Nature Biotechnolog*y

Yes, I know there are many deserving works not in here. When boiling down to five, the choice is almost arbitrary. If you feel offended that a work you like is not here, then I’m sorry.

## Compressed de Bruijn graph

Given a string *S* of length *n* and a natural number *k*, the de Bruijn graph of *S* contains a node for each distinct length *k* substring of *S*, called a *k*-mer. Two nodes *u* and *v* are connected by a directed edge (*u*, *v*) if *u* and *v* occur consecutively in *S*, i.e., (u = S[i..i+k-1]) and (v = S[i+1..i+k]) . Fig. 1 shows an example. Clearly, the graph contains at most *n* nodes and *n* edges. By construction, adjacent nodes will overlap by (k-1) characters, and the graph can include *multiple* edges connecting the same pair of nodes or self-loops representing overlapping repeats. For every node, except for the start node (containing the first *k* characters of *S*) and the stop node (containing the last *k* characters of *S*), the in-degree coincides with the out-degree. A de Bruijn graph can be “compressed” by merging non-branching chains of nodes into a single node with a longer string. More precisely, if node *u* is the only predecessor of node *v* and *v* is the only successor of *u* (but there may be multiple edges (*u*, *v*)), then *u* and *v* can be merged into a single node that has the predecessors of *u* and the successors of *v*. After maximally compressing the graph, every node (apart from possibly the start node) has at least two different predecessors or its single predecessor has at least two different successors and every node (apart from the stop node) has at least two different successors or its single successor has at least two different predecessors see Fig. 1. Of course, the compressed de Bruijn graph can be built from its uncompressed counterpart (a much larger graph), but this is disadvantageous because of the huge space consumption. That is why we will build it directly.

Explicit representation of the compressed de Bruijn graph from Fig. 1

Figure 2 shows how splitMEM represents the compressed de Bruijn graph *G* for (k=3) and the string (S=) ACTACGTACGTACG$. Each node corresponds to a substring (omega ) of *S* and consists of the components (*id*, *len*, *posList*, *adjList*), where *id* is a natural number that uniquely identifies the node, *len* is the length (|omega |) of (omega ) , *posList* is the list of positions at which (omega ) occurs in *S* (sorted in ascending order), and *adjList* is the list of the successors of the node (sorted in such a way that the walk through *G* that gives *S* is induced by the adjacency lists: if node *G*[*id*] is visited for the *i*-th time, then its successor is the node that can be found at position *i* in the adjacency list of *G*[*id*]).

The nodes in the compressed de Bruijn graph of a pan-genome can be categorized as follows:

A uniqueNode represents a unique substring in the pan-genome and has a single start position (i.e., *posList* contains just one element)

A repeatNode represents a substring that occurs at least twice in the pan-genome, either as a repeat in a single genome or as a segment shared by multiple genomes.

In pan-genome analysis, *S* is the concatenation of multiple genomic sequences, where the different sequences are separated by a special symbol (#) . (In theory, one could use pairwise different symbols to separate the sequences, but in practice this would blow up the alphabet.) This has the effect that (#) may be part of a repeat. In contrast to splitMEM, our algorithm treats the different occurrences of (#) as if they were different characters. Consequently, (#) will not be a part of a repeat. In our approach, each occurrence of (#) will be the end of a stop node (i.e., there is a stop node for each sequence).

According to [5], the compressed de Bruijn graph is most suitable for pan-genome analysis: “This way the complete pan-genome will be represented in a compact graphical representation such that the shared/strain-specific status of any substring is immediately identifiable, along with the context of the flanking sequences. This strategy also enables powerful topological analysis of the pan-genome not possible from a linear representation.” It has one defect though: it is not possible to search efficiently for certain nodes and then to explore the graph in the vicinity of these nodes. A user might, for example, want to search for a certain allele in the pan-genome and—if it is present—to examine the neighborhood of that allele in the graph. Here, we propose a new space-efficient representation of the compressed de Bruijn graph that adds exactly this functionality.

We store the graph in an array *G* of length *N*, where *N* is the number of nodes in the compressed de Bruijn graph. Moreover, we assign to each node a unique identifier (id in <1,dots ,N>) . A node *G*[*id*] now has the form ((len,lb,size,< suffix>\_lb)) , where

The variable *len* is the length of the string (omega = S[mathsf *id*

([lb..lb+size-1]) is the (omega ) -interval, *lb* is its left boundary, and *size* is its size

([< suffix>\_lb..< suffix>\_lb+size-1]) is the interval of the *k* length suffix of (omega )

There is one exception though: the sentinel $ and each occurrence of the separator (#) will be the end of a stop node. Clearly, the suffix $ of *S* appears at index 1 in the suffix array because $ is the smallest character in the alphabet. The suffix array interval of $ is [1..1], so we set (< suffix>\_lb= 1) . Analogously, a suffix of *S* that starts with (#) appears at an index (j in <2,dots ,d>) in the suffix array (where (d) is the number of sequences in *S*) because (#) is the second smallest character in the alphabet, so we set (< suffix>\_lb= j) .

Implicit representation of the compressed de Bruijn graph from Fig. 1

Figure 3 shows an example. Henceforth this representation will be called implicit representation, while the representation from Fig. 2 will be called explicit representation. It is clear that in the implicit representation the list of all positions at which (omega ) occurs in *S* can be computed as follows: ([mathsf

## What is the another use of de Bruijn graphs in bioinformatics except DNA assembly? - Biology

Bioinformatics, 32, 2016, i201–i208 doi: 10.1093/bioinformatics/btw279 ISMB 2016

Compacting de Bruijn graphs from sequencing data quickly and in low memory Rayan Chikhi1,*, Antoine Limasset2 and Paul Medvedev3,4,5 1

CNRS, CRIStAL, Lille, France, 2ENS Cachan Brittany, Bruz, France, 3Department of Computer Science and Engineering, The Pennsylvania State University, USA, 4Department of Biochemistry and Molecular Biology, The Pennsylvania State University, USA and 5Genome Sciences Institute of the Huck, The Pennsylvania State University, USA *To whom correspondence should be addressed.

Abstract Motivation: As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem. Results: We present an algorithm and a tool BCALM 2 for the compaction of de Bruijn graphs. BCALM 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, BCALM 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied BCALM 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, BCALM 2 is at least an order of magnitude more efficient than other available methods. Availability and Implementation: Source code of BCALM 2 is freely available at: https://github.com/ GATB/bcalm Contact: [email protected]

1 Introduction Modern sequencing technology can generate billions of reads from a sample, whether it is RNA, genomic DNA, or a metagenome. In some applications, a reference genome can allow for the mapping of these reads however, in many others, the goal is to reconstruct long contigs. This problem is known as fragment assembly and continues to be one of the most important challenges in bioinformatics. Fragment assembly is the central algorithmic component behind the assembly of novel genomes, detection of gene transcripts (RNA-seq) (Grabherr et al., 2011), species discovery from metagenomes, structural variant calling (Iqbal et al., 2012). Continued improvement to sequencing technologies and increases to the quantity of data produced per experiment present a serious challenge to fragment assembly algorithms. For instance, while there exist many genome assemblers that can assemble bacterial sized genomes, the number of assemblers that can assemble a high-quality mammalian genome is limited, with most of them developed by large teams and requiring extensive resources (Gnerre C The Author 2016. Published by Oxford University Press. V

et al., 2011 Luo et al., 2012 Simpson et al., 2009). For even larger genomes, such as the 20 Gbp Picea glauca (white spruce), graph construction and compaction took 4.3 TB of memory, 38 h and 1380 CPU cores (Birol et al., 2013). In another instance, the whole genome assembly of 22 Gbp Pinus taeda (loblolly pine) required 800 GB of memory and three months of running time on a single machine (Zimin et al., 2014). Most short-read fragment assembly algorithms use the de Bruijn graph to represent the information from a set of reads. Given a set of reads R, every distinct k-mer in R forms a vertex of the graph, while an edge connects two k-mers if they overlap by k – 1 characters. The use of the de Bruijn graph in fragment assembly consists of a multi-step pipeline, however, the most data intensive steps are usually the first three: nodes enumeration, compaction and graph cleaning. In the first step (sometimes called k-mer counting), the set of distinct k-mers is extracted from the reads. In the second step, all unitigs (paths with all but the first vertex having in-degree 1 and all but the last vertex having out-degree 1) are compacted into a single i201

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

i202 vertex. In the third step, artifacts due to sequencing errors and polymorphism are removed from the graph. The second and third step are sometimes alternated to further compact the graph. After these initial steps, the size of the data is reduced gradually, e.g. for a human dataset with 45 coverage, To overcome the scalability challenges of fragment assembly of large sequencing datasets, there has been a focus on improving the resource utilization of de Bruijn graph construction. In particular, k-mer counting has seen orders of magnitude improvements in memory usage and speed. As a result, graph compaction is becoming the new bottleneck but, it has received little attention (Kundeti et al., 2010). Recently, we developed a compaction tool that uses low memory, but without an improvement in time (Chikhi et al., 2014). Other parallel approaches for compaction have been proposed, as part of genome assemblers. However, most are only implemented within the context of a specific assembler, and cannot be used as modules for the construction of other fragment assemblers or for other applications of de Bruijn graphs (e.g. metagenomics). In this paper, we present a fast and low memory algorithm for graph compaction. Our algorithm consists of three stages: careful distribution of input k-mers into buckets, parallel compaction of the buckets, and a parallel reunification step to glue together the compacted strings into unitigs. The algorithm builds upon the use of minimizers to partition the graph (Chikhi et al., 2014) however, the partitioning strategy is completely novel since the strategy of Chikhi et al. (2014) does not lend itself to parallelization. Due to the algorithm’s complexity, we formally prove its correctness. We then evaluate it on whole-genome human, pine and spruce sequencing data. The de Bruijn graph for a whole human genome dataset is compacted in roughly an hour and 3 GB of memory using 16 cores. For the >20 Gbp pine and spruce genomes, k-mer counting and graph compaction take only 2 days and 40 GB of memory, improving on previously published results by at least an order of magnitude.

2 Related work The parallelization of de Bruijn graph compaction has been previously explored. In (Jackson et al., 2010 Kundeti et al., 2010), the problem is reduced to the classic list ranking problem and solved using parallel techniques such as pointer jumping. Another recurrent MPI-based approach is to implement a distributed hash table, where the k-mers and the information about their neighborhoods are distributed amongst processes. Each processor then extends seed k-mers locally as far as possible to build sub-unitigs and then passes them off to other processors for further extension. Variants of this approach are used in (Georganas et al., 2014 Liu et al., 2011 Simpson et al., 2009). Other papers have proposed using a parallelized depth-first search (Zeng et al., 2013) or a small world asynchronous parallel model (Meng et al., 2014, 2012). Before a de Bruijn graph can be compacted, it has to be constructed. Parallel approaches currently represent the state-of-the-art in this area. Many original efforts were focused on edge-centric de Bruijn graphs, where edges are represented by ðk þ 1Þ-mers. They required the identification of both all distinct k-mers and ðk þ 1Þmers (Jackson and Aluru, 2008 Jackson et al., 2010 Kundeti et al., 2010 Lu et al., 2013 Zeng et al., 2013). More recent efforts have focused on the node-centric graph, which only requires the counting of k-mers (Deorowicz et al., 2014 Li et al., 2013 Lu et al., 2013 Marc¸ais and Kingsford, 2011 Melsted and Pritchard, 2011 Rizk et al., 2013 Simpson et al., 2009).

R.Chikhi et al. In genome assembly, the construction and compaction of a de Bruijn graph form only the initial stages. There are also alternate approaches that do not use the de Bruijn graph at all (e.g. greedy or string graph). Numerous parallel assemblers are available for use, including ABySS (Simpson et al., 2009), SOAPdenovo (Luo et al., 2012), Ray (Boisvert et al., 2010), PASQUAL (Liu et al., 2013), PASHA (Liu et al., 2011), SAND (Moretti et al., 2012), SWAPAssembler (Meng et al., 2014). Other methods for parallel assembly have been published but without publicly available software (Duan et al., 2014 Garg et al., 2013 Georganas et al., 2015 Jackson et al., 2010). There has also been work done in reducing the overall memory footprint de Bruijn graph assembly. This challenge is most pronounced for k-mer counters. However, when scaling to mammaliansized genomes, memory usage continues to be an issue in downstream steps such as compaction. Chikhi et al. (2014) used minimizers to compact the de Bruijn graph of a human whole-genome dataset in under 50 MB of memory, but the algorithm did not improve the running time. Wu et al. (2012) propose an approach based on dividing the assembly problem into mutually independent instances. Ye et al. (2012) exploit the notion of graph sparseness for reducing memory use. Kleftogiannis et al. (2013) perform a comparative analysis and propose several memory-reducing strategies. Chikhi and Rizk (2012) use Bloom filters to reduce memory usage. Movahedi et al. (2012) propose a divide-and-conquer approach for compacting a de Bruijn graph.

3 Definitions We assume, for the purposes of this paper, that all strings are over the alphabet R ¼ fA C G Tg. A string of length k is called a k-mer. For a string s, we define its k-spectrum, spk ðsÞ, as the multi-set of all k-mer substrings of s. For a set of strings S, we define its multi-set k-spectrum as spk ðSÞ ¼ [s2S spk ðsÞ. For two strings u and v, we write u 2 v to mean that u is a substring of v. We write u½i::j to denote the substring of u from the ith to the jth character, inclusive. We define sufk ðuÞ ¼ u½juj k þ 1 juj and prek ðuÞ ¼ u½1::k. For two strings u and v such that sufk ðuÞ ¼ prek ðuÞ, we define a glue operation as uk v ¼ u v½k þ 1::jvj. The binary relation u ! v between two strings denotes that sufk1 ðuÞ ¼ prek1 ðvÞ. For a set of k-mers K, the de Bruijn graph of K is a directed graph such that the nodes are exactly the k-mers in K and the edges are given by the ! relation. Note that our definition of the de Bruijn graph is node-centric, where the edges are implicit given the vertices therefore, we use the terms de Bruijn graph and a set of k-mers interchangeably. Suppose we are given a de Bruijn graph, represented by a set of k-mers K. Consider a path p ¼ ðx1 . . . xm Þ over m 1 vertices. We allow the path to be a cycle, i.e. it is possible that x1 ¼ xm . The endpoints of a path are x1 and xm if it is not a cycle. A single-vertex path has one endpoint. A cycle does not have endpoints. The internal vertices of a path are vertices that are not endpoints. p is said to be a unitig if either jpj ¼ 1 or for all 1 ‘ and a string x with at least k characters, we define lmmðxÞ as the ‘-minimizer of the prefix ðk 1Þ-mer, and rmmðxÞ as the ‘-minimizer of the suffix ðk 1Þ-mer. We refer to these as the left and right minimizers of x, respectively. Two strings (u, v) are m-compactable in S if they are compactable in S and if m ¼ rmmðuÞ ¼ lmmðvÞ. The m-compaction of a set S is obtained from S by applying the compaction operation as much as possible in any order to all pairs of strings that are m-compactable in S.

4 Algorithm overview In this section, we give a high-level description of our BCALM 2 algorithm (Algorithm 1), leaving important optimizations and implementation details to Section 6. Recall that the input is a set of k-mers K and the output are the strings corresponding to all the maximal unitigs of the de Bruijn graph of K. If time and memory are not an issue, then there is a trivial algorithm: repeatedly find compactable strings and compact them until no further compactions are possible. However, such an approach requires loading all the data into memory, which is not feasible for larger genomes. Instead, BCALM 2 proceeds in three stages. In the first stage, the k-mers are distributed into buckets, with some k-mers being thrown into two buckets. In the second stage, each bucket is compacted, separately. In the third stage, the k-mers that were thrown into two buckets are glued back together so that duplicates are removed. Figure 1 shows the execution of BCALM 2 on a small example.

Input: the set of k-mers K. 1: for all parallel x 2 K do 2: Write x to FðlmmðxÞÞ. 3: if lmmðxÞ 6¼ rmmðxÞ then 4: Write x to FðrmmðxÞÞ. 5: for all parallel i 2 f1 . . . 4‘ g do 6: Run CompactBucket(i) 7: ReuniteðÞ

In the second stage of the algorithm, we process each bucket file using the CompactBucket procedure (Algorithm 2). After the k-mer distribution of the first stage, the bucket file F(i) contains all the k-mers whose left or right minimizer is i. We can therefore load F(i) into memory and perform i-compaction on it. Since the size of the bucket is small, this compaction can be performed using a simple inmemory algorithm. The resulting strings are then written to disk, and will be processed during the third stage. At the end of the second stage, when all CompactBucket procedures are finished, we have performed all the necessary compactions on the data. At this stage of the algorithm, notice that the k-mers x 2 K with lmmðxÞ 6¼ rmmðxÞ exist in two copies. We call such k-mers doubled. We will prove in Section 5 that these k-mers are always at the ends (prefix or suffix) of the compacted strings, never internal, and they can be recognized by the fact that the minimizer at that end does not correspond to the bucket where it resides. We record these ends that have doubled k-mers by marking them ‘lonely’ (lines 4 and 5 of Algorithm 2), since they will need to be ‘reunited’ at the third stage of the algorithm. Strings that have no lonely ends are maximal unitigs, therefore they are output (line 8).

Algorithm 3. Reunite() Input: the set of strings R from the Reunite file. 1: UF Union find data structure whose elements are the distinct k-mer extremities in R. 2: for all parallel u 2 R do 3: if both ends of u are lonely then 4: UF:unionðsufk ðuÞ prek ðuÞÞ 5: for all parallel classes C of UF do 6: P all u 2 R that have a lonely extremity in C 7: while 9u 2 P that does not have a lonely prefix do 8: Remove u from P 9: Let s ¼ u 10: while 9 v 2 P such that sufk ðsÞ ¼ prek ðvÞ do 11: s Glueðs vÞ 12: Remove v from P 13: Output s

Algorithm 4. Glue(u, v) In the first stage (lines 1–6 of Algorithm 1), BCALM 2 distributes the k-mers of K to files Fð1Þ . . . Fð4‘ Þ. These are called bucket files. Each k-mer x 2 K goes into file FðlmmðxÞÞ, and if lmmðxÞ 6¼ rmmðxÞ, also in FðrmmðxÞÞ. The parameter ‘ controls the minimizer size (in our implementation, we set ‘ ¼ 8). Algorithm 2. CompactBucket(i) 1: Load F(i) into memory. 2: U i-compaction of F(i). 3: for all strings u 2 U do 4: Mark u’s prefix as “lonely” if i 6¼ lmmðuÞ. 5: Mark u’s suffix as “lonely” if i 6¼ rmmðuÞ. 6: if u’s prefix and suffix are not lonely then 7: Output u. 8: else 9: Place u in the Reunite file

Input: strings u and v, such that sufk ðuÞ ¼ prek ðvÞ. 1: Let w ¼ uk v. 2: Set lonely prefix bit of w to be the lonely prefix bit of u. 3: Set lonely suffix bit of w to be the lonely suffix bit of v. 4: return w

At the third stage of the algorithm, we process the strings output by CompactBucket with the Reunite procedure (Algorithm 3). At a high level, the purpose of Reunite is to process each string u that has a lonely end, and find a corresponding string v that has a matching lonely end with the same k-mer. When one is found, then u and v are glued together (Algorithm 4), thereby ‘reuniting’ the doubled k-mer that was split in the k-mer distribution stage. The new string inherits its end lonely marks from the glued strings, and the process is then repeated for the next string u that has a lonely end. After Reunite() completes, all duplicate k-mers will have been removed, and the strings in the output will correspond to the maximal unitigs.

Fig. 1. Execution of BCALM 2 on a small example, with k ¼ 4 and ‘ ¼ 2. On the top left, we show the input de Bruijn graph. The maximal unitigs correspond to the path from CCCT to TCTA (spelling CCCTCTA), and to the k-mers CCCC, CCCA, CTAC, CTAA. In this example, minimizers are defined using a lexicographic ordering of ‘-mers. In the top right, we show the contents of the bucket files. Only five of the bucket files are non-empty, corresponding to the minimizers CC, CT, AA, AC and CA. The doubled k-mers are italicized. Below that, we show the set of strings that each i-compaction generates. For example in the bucket CC, the k-mers CCCT and CCTC are compacted into CCCTC, however CCCC and CCCT are not compactable because CCCA is another out-neighbor of CCCC. The lonely ends are denoted by •. In the bottom half we show the execution steps of the Reunite algorithm. Nodes in bold are output

To perform these operations efficiently in time and memory, Reunite first partitions the strings of R so that any two strings that need to be reunited are guaranteed to be in the partition. Then, each partition can be processed independently. To achieve the partition, we use a union-find (UF) data structure of all k-mers extremities. Recall that a UF data structure is created by first assigning a set to each distinct element (here, an element is the k-mer extremity of a string). Then, the union operation replaces the sets of two elements by a single set corresponding to their union. Here, union is applied to both k-mer extremities of a string. After the UF is constructed, the set of strings to be reunited is partitioned such that k-mer extremities of sequences in a partition all belong to the same UF set.

besides at t½p k þ 1 p. Since there are no duplicate k-mers in T, this is a contradiction. Now, we show that S T. Let s 2 S and let t 2 T such that s 2 t. By applying an argument symmetrical to the one above, there exists a s0 2 S such that t 2 s0 . This means that s 2 s0 , and, in particular, s½1::k 2 s0 . Since k-mers can only appear once in S, we must have that s ¼ s0 and hence s ¼ t 2 T. h Next, we characterize the k and k þ 1 spectrums of U. Given a multi-set M, we denote by Set(M) as the set version of M, with all multiplicity information implicitly removed. When referring to a set, such as K, as a multi-set, we will mean that all the elements have multiplicity one. LEMMA 2. spk ðUÞ ¼ K

5 Proof of correctness Recall that K is the input to the algorithm and let U be the strings corresponding to the set of all maximal unitigs of K. We will assume for our proof that U does not contain any circular unitigs. We note that since BCALM 2 outputs strings, it cannot represent circular unitigs in its output. Circular unitigs present a corner case for both the analysis and the algorithm itself, and, for the sake of presentation brevity, we do not consider them here. We prove the correctness of BCALM 2 by showing that it outputs U. We first give a Lemma that will allow us to show that the output is U by arguing about its k and k þ 1 spectrums. LEMMA 1. Let S and T be two sets of strings of length at least k such that spkþ1 ðSÞ ¼ spkþ1 ðTÞ and spk ðSÞ ¼ spk ðTÞ and all these spectrums are without duplicates. Then, S ¼ T. PROOF. We will prove that S T. The same argument will be symmetrically applicable to prove T S, which will imply S ¼ T. First, we show that for all s 2 S, there exists a t 2 T such that s 2 t. Let s 2 S and let p ¼ maxfi : 9t 2 T s½1::i 2 tg, and let t be a string achieving the max. Note that p k þ 1 since every ðk þ 1Þmer of S is also in T. Suppose for the sake of contradiction that p 362KB Sizes 0 Downloads 5 Views

## Introduction

The de Bruijn graph is a data structure first brought to bioinformatics as a method to assemble genomes from the experimental data generated by sequencing by hybridization [1]. It later became the key algorithmic technique in genome assembly [2, 3] that resulted in dozens of software tools [4–12]. In addition, the de Bruijn graphs have been used for repeat classification [13], de novo protein sequencing [14], synteny block construction [15, 16], multiple sequence alignment [17], and other applications in genomics and proteomics.

The breakpoint graph is a data structure introduced to study the reversal distance [18], which has formed the basis for much algorithmic research on rearrangements over the last two decades [19].

Since the connections between the breakpoint graphs and the de Bruijn graphs was never explicitly described, researchers studying genome rearrangements often do not realize that breakpoint graphs are merely de Bruijn graphs in disguise. As a result, they often do not know how to move from the traditional breakpoint graphs on *synteny blocks* to the breakpoint graphs on genomes (with "single nucleotide" resolution), particularly in the case of double-stranded *genomes* with inverted repeats. Likewise, researchers working in genome assembly are often unaware about the connections between the de Bruijn graphs and the breakpoint graphs. As a result, the exchange of ideas between these two communities has been limited. For example, Iqbal et al. [20] recently introduced the notion of the *colored de Bruijn graphs* that resulted in a popular Cortex assembler. While the notion of the colored de Bruijn graphs is essentially identical to the notion of the breakpoint graph, authors of [20] are probably unaware about this connection since they provided no references to previous genome rearrangement studies. This is unfortunate since various results about the breakpoint graphs (e.g., the connection between rearrangements and alternating cycles) remained beyond the scope of this very useful study.

Recently, genome rearrangement studies moved from the level of synteny blocks to the level of single nucleotides [21]. Likewise, genome assembly experts recently moved towards the analysis of structural variations and comparative assembly of related species based on the analysis of the de Bruijn graphs [20]. We thus argue that the time has come to explain that the breakpoint graphs and the de Bruijn graphs are two identical data structures (if one ignores a cosmetic difference between them) as they both represent specific instances of a general notion of the *A-Bruijn graph* introduced in [13]. The A-Bruijn graphs are based on representing genomes as sets of labeled paths and further gluing identically labeled edges (breakpoint graphs) or vertices (de Bruijn graphs) in the resulting paths.

We argue that a unified framework covering both breakpoint and de Bruijn graphs is important to bridge the divide between researchers working with breakpoint graphs (that usually focus in rearrangements and ignore repeats) and researchers working with de Bruijn graphs (that usually focus on repeats and ignore rearrangements). In reality, there exists a complex interplay between rearrangements and repeats, e.g., LINE repeats and segmental duplications often trigger rearrangements [22–24]. However, this interplay is not explicitly revealed by the breakpoint graphs since they do not even encode repeats (repeats are intentionally masked at the synteny block construction step). For example, the interplay between LINEs and rearrangements cannot be derived from the breakpoint graph alone forcing Zhao and Bourque [23] to perform additional analysis. Our goal is to introduce the graphs that encode both rearrangements and repeats and immediately reveal this interplay. For example, encoding repeats present in the breakpoint regions (that may potentially trigger rearrangements) leads to gluing alternating cycles in the breakpoint graphs and requires development of new algorithms that integrate rearrangements and repeats. In such graphs, the classical *non-self-intersecting alternating cycles* formed by edges alternating between two colors (the workhorse of genome rearrangement studies) may turn into *self-intersecting* cycles formed by edges alternating between 3 colors, where the third color corresponds to repeated elements (see Figure 1). Nurk and Pevzner [25] recently used this framework to develop a new comparative genome analysis tool SPArcle and applied it to analyzing multiple bacterial strains resulting from the "controlled evolution" experiments [26]. SPArcle is based on SPAdes assembler and, in difference from Cortex, it uses ideas from the previous genome rearrangement studies (e.g., alternating cycles) to analyze the resulting A-Bruijn graphs.

**Genomes** *G*_{A}**=** *S*_{1}**,** *S*_{2}**,** *S*_{3} **and** *G*_{B}**=** *S*_{1}**, −** *S*_{2}**,** *S*_{3} **(represented as bicolored paths) differ from each other by a single reversal of segment** *S*_{2}. The breakpoint graph *BP*(*G* _{A}*, G* _{B}) and the alternating cycle constructed from genomes *G* _{A} and *G* _{B} (left: no repeats at the breakpoint regions right, a pair of repeats colored in green at the breakpoint regions).

Genome rearrangement studies usually start from constructing a set of *synteny blocks* shared by two genomes (see Figure 2). Each genome is defined as a sequence of synteny blocks separated by *breakpoint regions* and is represented as a path formed by alternating colored and black edges, where synteny blocks correspond to directed and labeled black edges and breakpoint regions correspond to undirected colored edges. Figure 3(a) presents paths corresponding to 11 synteny blocks shared by Human and Mouse × chromosomes. Each synteny block *S* _{i} is represented as an directed black edge ( S i t , S i h ) , where S i t and S i h refer to the endpoints of the synteny blocks representing its tail and head, respectively. Two consecutive synteny blocks are separated by a breakpoint region in the Human (Mouse) × chromosome that is modeled by a red (blue) edge connecting the corresponding endpoints of these synteny blocks. The (traditional) breakpoint graph of Human and Mouse × chromosomes is obtained by "gluing" identically labeled black edges in these two paths as shown in Figure 3.

**The 11 synteny blocks shared by Human and Mouse × chromosomes (adapted from Pevzner and Tesler** [33]**)**.

**Publisher’s note:** Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

### Supplementary Figure 1 A comparison of the Flye and HINGE assembly graphs on bacterial genomes from the BACTERIA dataset.

(Left) The Flye and Hinge assembly graphs of the KP9657 dataset. There is a single unique edge entering into (and exiting) the unresolved “yellow” repeat and connecting it to the rest of the graph. Thus, this repeat can be resolved if one excludes the possibility that it is shared between a chromosome and a plasmid. In contrast to HINGE, Flye does not rule out this possibility and classifies the yellow repeat as unresolved. (Right) The Flye and Hinge assembly graphs of the EC10864 dataset show a mosaic repeat of multiplicity four formed by yellow, blue, red and green edges (the two copies of each edge represent complementary strands). HINGE reports a complete assembly into a single chromosome.

### Supplementary Figure 2 The assembly graph of the YEAST-ONT dataset.

Edges that were classified as repetitive by Flye are shown in color, while unique edges are black. Flye assembled the YEAST-ONT dataset into a graph with 21 unique and 34 repeat edges and generated 21 contigs as unambiguous paths in the assembly graph. A path *v*_{1}*, …v*_{i}*, v*_{i+1} …*v*_{n} in the graph is called unambiguous if there exists a single incoming edge into each vertex of this path before *v*_{i+1} and a single outgoing edge from each vertex after *v*_{i}. Each unique contig is formed by a single unique edge and possibly multiple repeat edges, while repetitive contigs consist of the repetitive edges which were not covered by the unique contigs. The visualization was generated using the graphviz tool (http://graphviz.org).

### Supplementary Figure 3 The assembly graph of the WORM dataset.

Edges that were classified as repetitive by Flye are shown in color, while unique edges are black. Flye assembled the WORM dataset into a graph with 127 unique and 61 repeat edges and generated 127 contigs as unambiguous paths in the assembly graph. The visualization was generated using the graphviz tool (http://graphviz.org).

### Supplementary Figure 4 Dot plots showing the alignment of reads against the Flye assembly, the Miniasm assembly and the reference C. elegans genome.

(**a**) The reference genome contains a tandem repeat of length 1.9 kb (10 copies) on chromosome X with the repeated unit having length ≈190 nucleotides. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of length 5.5 kb (27 copies) and 2.8 kb (13 copies), respectively. 15 reads that span over the tandem repeat support the Flye assembly (the mean length between the flanking unique sequence matches the repeat length reconstructed by Flye) and suggests that the Flye length estimate is more accurate. (**b**) The reference genome contains a tandem repeat of length 2 kb on chromosome 1. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of length 10 kb and 5.6 kb, respectively. A single read that spans over the tandem repeat supports the Flye assembly. Since the mean read length in the WORM dataset is 11 kb, it is expected to have a single read spanning a given 10.0 kb region but many more reads spanning any 5.6 kb region (as implied by the Miniasm assembly) or 2.0 kb region (as implied by the reference genome). Six out of 23 reads cross the “left” border (two out of 18 reads cross the “right” border) of this tandem repeat by more than 5.6 kb, thus contradicting the length estimate given by Miniasm and suggesting that the Flye length estimate is more accurate. (**c**) The reference genome contains a tandem repeat of length 3 kb on chromosome X. In contrast, the Flye and Miniasm assemblies of this region suggest a tandem repeat of lengths 13.6 kb and 8 kb, respectively. A single read that spans over the tandem repeat reveals the repeat cluster to be of length 12.2k, which suggests that the Flye length estimate is more accurate. (**d**) The reference genome contains a tandem repeat of length 1.5 kb on chromosome 1. In contrast, the Flye and Miniasm assemblies of this region suggest tandem repeats of length 17 kb and 4.3 kb, respectively. One read that spans over the tandem repeat reveals the repeat cluster to be of length 18.0 kb, which suggests that the Flye length estimate is more accurate.