Information

How is output of DNA assembler measured?

How is output of DNA assembler measured?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I developed own DNA assembly pipeline. Input is set of reads and output is set of contigs. Many papers measure own algorithms and compare it against each other. There are basic metrics like:

  • N50, N70… ,
  • number of contigs,
  • genom coverage,
  • error rate

I download tool QUAST and there I can see metric N50, L50, where is easy to develop own computation. But how I can measure error rate of my output contigs? Is there any tool which can do it? I searched in many papers but I cannot find way how they do it. I would like to compare with some assemblers with some general way. I found only hints that it should be done by aligning contigs to reference genomes but I do not know how and if it is right way.


If you are interested on the topic of how to compare assemblers, have a look at assemblathon. That group has two papers and working on a third about comparing genome assembly algorithms.


Comparison of DNA Assembly Reaction Types

NEBuilder HiFi DNA Assembly offers error-free assembly that can be used for a wide range of reaction types. Furthermore, there are no licensing fee requirements from NEB for NEBuilder HiFi DNA Assembly products. See how it compares to NEB Gibson Assembly® and In-Fusion® HD.

NEBuilder HiFi DNA Assembly NEB Gibson Assembly In-Fusion HD
Assembly reaction types Assembly efficiency Covalently sealed?* Assembly efficiency Covalently sealed?* Assembly efficiency Covalently sealed?*
2-fragment assembly
No mismatch +++ Yes ++ Yes ++ No
3´- & 5´-end mismatch +++ Yes ++ Yes X No
4-fragment assembly
15-bp overlap
& no mismatch
+++ Yes ++ Yes ++ No
25-bp overlap
& no mismatch
+++ Yes ++ Yes ++ No
Oligo assembly
3´- and 5´-overhang +++ Yes ++ Yes X No
Blunt end
& no mismatch
+++ Yes ++ Yes X No
ssOligo & vector +++ Yes NP Yes X No

* Assembled products are treated with T5 exonuclease followed by PCR. Only covalently sealed products resistant to T5 exonuclease digestion can serve as templates for PCR and yield PCR product.

+++ Performs best recommended
++ Performs well but other product(s) perform better
+ Performs, but not recommended
X Does not perform
NP Experiment not performed


One or more of these products are covered by patents, trademarks and/or copyrights owned or controlled by New England Biolabs, Inc. For more information, please email us at [email protected] The use of these products may require you to obtain additional third party intellectual property rights for certain applications.

IN-FUSION® is a registered trademark of ClonTech Laboratories, Inc.
GIBSON ASSEMBLY® is a registered trademark of Synthetic Genomics, Inc.


A Molecular View of Kinetochore Assembly and Function

Kinetochores are large protein assemblies that connect chromosomes to microtubules of the mitotic and meiotic spindles in order to distribute the replicated genome from a mother cell to its daughters. Kinetochores also control feedback mechanisms responsible for the correction of incorrect microtubule attachments, and for the coordination of chromosome attachment with cell cycle progression. Finally, kinetochores contribute to their own preservation, across generations, at the specific chromosomal loci devoted to host them, the centromeres. They achieve this in most species by exploiting an epigenetic, DNA-sequence-independent mechanism notable exceptions are budding yeasts where a specific sequence is associated with centromere function. In the last 15 years, extensive progress in the elucidation of the composition of the kinetochore and the identification of various physical and functional modules within its substructure has led to a much deeper molecular understanding of kinetochore organization and the origins of its functional output. Here, we provide a broad summary of this progress, focusing primarily on kinetochores of humans and budding yeast, while highlighting work from other models, and present important unresolved questions for future studies.

Keywords: CCAN CENP-A KMN cell division centromere kinetochore meiosis mitosis.

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Kinetochore morphology in vertebrate cells…

Kinetochore morphology in vertebrate cells ( A ) Schematic showing the attachment of…

Schematic summary of the structural…

Schematic summary of the structural organization of budding yeast and human kinetochores. Related…

The CENP-A nucleosome and its…

The CENP-A nucleosome and its specific recognition by CENP-C. ( A ) Comparison…

The NDC80 and MIS12 complexes…

The NDC80 and MIS12 complexes of the KMN network. ( A ) The…

65 nm. Images in (B,D) courtesy of Dr. Pim Huis in ‘t Veld, Max Planck Institute of Molecular Physiology, Dortmund (Germany) [239] (C) Model from cryo-EM studies of the Ndc80 Bonsai complex bound to the microtubule lattice. Only a single α-tubulin:β-tubulin dimer is shown, with two Ndc80 Bonsai complexes bound via the toe region (D) Complexes of the NDC80C and MIS12C are

85 nm in length (E) Structural organization of the MIS12 complex bound to the N-terminal region of CENP-C [241]. All structures shown are for the human complexes.

Orchestration of the spindle assembly…

Orchestration of the spindle assembly checkpoint (SAC) by the KMN network SAC components…

Linkages between the inner and…

Linkages between the inner and outer kinetochore. Structures of a portion of the…

Images and structural model of…

Images and structural model of budding yeast kinetochore particles. ( A ) Negative…

A model for the assembly unit of the human kinetochore. The kinetochore assembly…

Cell cycle-regulated replenishment of CENP-A…

Cell cycle-regulated replenishment of CENP-A nucleosomes New. CENP-A incorporation takes place after mitotic…


When one become two: Separating DNA for more accurate nanopore analysis

A new software tool developed by Earlham Institute researchers will help bioinformaticians improve the quality and accuracy of their biological data, and avoid mis-assemblies. The fast, lightweight, user-friendly tool visualises genome assemblies and gene alignments from the latest next generation sequencing technologies.

Called Alvis, the new visualisation tool examines mappings between DNA sequence data and reference genome databases. This allows bioinformaticians to more easily analyse their data generated from common genomics tasks and formats by producing efficient, ready-made vector images.

First author and post-doctoral scientist at the Earlham Institute Dr Samuel Martin in the Leggett Group, said: "Typically, alignment tools output plain text files containing lists of alignment data. This is great for computer parsing and for being incorporated into a pipeline, but it can be difficult to interpret by humans.

"Visualisation of alignment data can help us to understand the problem at hand. As a new technology, several new alignment formats have been implemented by new tools that are specific to nanopore sequencing technology.

"We found that existing visualisation tools were not able to interpret these formats Alvis can be used with all common alignment formats, and is easily extensible for future ones."

A key feature of the new command line tool is its unique ability to automatically highlight chimeric sequences -- weak links in the DNA chain. This is where two sequences -- from different parts of a genome or different species -- are linked together by mistake to make one, affecting the data's accuracy.

Chimera sequences can be problematic for bioinformaticians when identifying specific DNA. The chimera formation can physically happen to the DNA molecules during either sequencing library preparation, during the sequencing process on some platforms, and by assembly tools when trying to piece together a genome.

During the development of the tool, the team compared genome assemblies with and without using Alvis chimera detection. The vector image produced shows an example output, where the intuitive tool tracks all reads it recognises as chimeras.

"Although chimeric sequences don't make up a large proportion of samples, they can have a significant effect, so we have to be careful that we have identified them during analysis," said Dr Martin.

"In the Alvis diagram example of chimera data, each rectangle across the page represents a read, and the coloured blocks inside them represent alignments. Most chimeras are easy to see because their alignments are different colours, meaning they map to different genomes. Others are more subtle because both alignments are to the same genome, but different regions."

The Alvis tool can pinpoint visualisation of only chimeric sequences for further inspection, and output numerical data describing the chimeras. This demonstrates that by applying the tool and then bioinformatically splitting the chimeras, the quality of the assemblies is significantly improved.

Accessed over 600 times since being made available at the beginning of March this year, Dr Martin, adds: "We hope that Alvis continues to be useful to other researchers working with, for example, nanopore sequencing improving their understanding of their data by visualising alignments,''.

"Alignments are so fundamental to bioinformatics that it could be of use to anyone working with long read sequencing data, as well as alignments generated by sequencing data from short-read platforms. The diagrams that Alvis generates can be easily exported to directly use in publications, demonstrated in our study already."


When one become two: Separating DNA for more accurate nanopore analysis

A new software tool developed by Earlham Institute researchers will help bioinformaticians improve the quality and accuracy of their biological data, and avoid mis-assemblies. The fast, lightweight, user-friendly tool visualises genome assemblies and gene alignments from the latest next generation sequencing technologies.

Called Alvis, the new visualisation tool examines mappings between DNA sequence data and reference genome databases. This allows bioinformaticians to more easily analyse their data generated from common genomics tasks and formats by producing efficient, ready-made vector images.

First author and post-doctoral scientist at the Earlham Institute Dr Samuel Martin in the Leggett Group, said: "Typically, alignment tools output plain text files containing lists of alignment data. This is great for computer parsing and for being incorporated into a pipeline, but it can be difficult to interpret by humans.

"Visualisation of alignment data can help us to understand the problem at hand. As a new technology, several new alignment formats have been implemented by new tools that are specific to nanopore sequencing technology.

"We found that existing visualisation tools were not able to interpret these formats Alvis can be used with all common alignment formats, and is easily extensible for future ones."

A key feature of the new command line tool is its unique ability to automatically highlight chimeric sequences - weak links in the DNA chain. This is where two sequences - from different parts of a genome or different species - are linked together by mistake to make one, affecting the data's accuracy.

Chimera sequences can be problematic for bioinformaticians when identifying specific DNA. The chimera formation can physically happen to the DNA molecules during either sequencing library preparation, during the sequencing process on some platforms, and by assembly tools when trying to piece together a genome.

During the development of the tool, the team compared genome assemblies with and without using Alvis chimera detection. The vector image (example_contigalignment.pdf) produced shows an example output, where the intuitive tool tracks all reads it recognises as chimeras.

"Although chimeric sequences don't make up a large proportion of samples, they can have a significant effect, so we have to be careful that we have identified them during analysis," said Dr Martin.

"In the Alvis diagram example of chimera data, each rectangle across the page represents a read, and the coloured blocks inside them represent alignments. Most chimeras are easy to see because their alignments are different colours, meaning they map to different genomes. Others are more subtle because both alignments are to the same genome, but different regions."

The Alvis tool can pinpoint visualisation of only chimeric sequences for further inspection, and output numerical data describing the chimeras. This demonstrates that by applying the tool and then bioinformatically splitting the chimeras, the quality of the assemblies is significantly improved.

Accessed over 600 times since being made available at the beginning of March this year, Dr Martin, adds: "We hope that Alvis continues to be useful to other researchers working with, for example, nanopore sequencing improving their understanding of their data by visualising alignments,''.

"Alignments are so fundamental to bioinformatics that it could be of use to anyone working with long read sequencing data, as well as alignments generated by sequencing data from short-read platforms. The diagrams that Alvis generates can be easily exported to directly use in publications, demonstrated in our study already."

The paper "Alvis: a tool for contig and read ALignment VISualisation and chimera detection" is published in BMC Bioinformatics.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.


PCNA: structure, functions and interactions

Proliferating cell nuclear antigen (PCNA) plays an essential role in nucleic acid metabolism as a component of the replication and repair machinery. This toroidal-shaped protein encircles DNA and can slide bidirectionally along the duplex. One of the well-established functions for PCNA is its role as the processivity factor for DNA polymerase delta and epsilon. PCNA tethers the polymerase catalytic unit to the DNA template for rapid and processive DNA synthesis. In the last several years it has become apparent that PCNA interacts with proteins involved in cell-cycle progression which are not a part of the DNA polymerase apparatus. Some of these interactions have a direct effect on DNA synthesis while the roles of several other interactions are not fully understood. This review summarizes the structural features of PCNA and describes the diverse functions played by the protein in DNA replication and repair as well as its possible role in chromatin assembly and gene transcription. The PCNA interactions with different cellular proteins and the importance of these interactions are also discussed.


‘Omics, Bioinformatics, Computational Biology

This section describes emerging technologies for understanding the behavior of cells, tissues, organs, and the whole organism at the molecular level using methods such as genomics, proteomics, systems biology, bioinformatics, as well as the computational tools needed to analyze and make sense of the data. These technologies have the potential to facilitate the development of a predictive toxicology based on models built with existing in vivo data (animal and human), as well as new and existing in vitro and in silico data.

-Omics

Technologies that measure some characteristic of a large family of cellular molecules, such as genes, proteins, or small metabolites, have been named by appending the suffix “-omics,” as in “genomics.” Omics refers to the collective technologies used to explore the roles, relationships, and actions of the various types of molecules that make up the cells of an organism.

These technologies include:

  • Genomics, “the study of genes and their function” (Human Genome Project (HGP), 2003)
  • Proteomics, the study of proteins
  • Metabonomics, the study of molecules involved in cellular metabolism
  • Transcriptomics, the study of the mRNA
  • Glycomics, the study of cellular carbohydrates
  • Lipomics, the study of cellular lipids

Omics technologies provide the tools needed to look at the differences in DNA, RNA, proteins, and other cellular molecules between species and among individuals of a species. These types of molecular profiles can vary with cell or tissue exposure to chemicals or drugs and thus have potential use in toxicological assessments. Omics experiments can often be conducted in high-throughput assays that produce tremendous amounts of data on the functional and/or structural alterations within the cell. “These new methods have already facilitated significant advances in our understanding of the molecular responses to cell and tissue damage, and of perturbations in functional cellular systems” (Aardema & MacGregor, 2002).

The -omics technologies will continue to contribute to our understanding of toxicity mechanisms. Regulators are interested in these new technologies but are still sorting out how to incorporate the new information and technologies in regulatory decision making. For example, the US Food and Drug Administration’s Pharmacogenomic Data Submissions guidance document encourages the voluntary submission of genomics data but notes that the field of pharmacogenomics is still in its early developmental stages.

Bioinformatics

Bioinformatics is “the science of managing and analyzing biological data using advanced computing techniques” (HGP, 2003). Bioinformatics tools include computational tools that mine information from large databases of biological data. These tools are most commonly used to analyze large sets of genomics data. However, bioinformatics tools are also being developed for other types of biological data, such as proteomics.

The US National Center for Biotechnology Information (NCBI) serves as an integrated source of genomics information and bioinformatics tools for researchers. An important bioinformatics tool available at NCBI for proteomics and genomics is the Basic Local Alignment Search Tool (BLAST), which compares gene or protein sequences against databases that contain many archived sequences, in order to find regions of local similarity. The statistical significance of the sequence matches is then calculated, and the results can be used to infer functional and evolutionary relationships.

Computational Biology

Bioinformatics and databases of biological information can be used to generate “maps” of cellular and physiological pathways and responses. This integrative approach is called computational biology. “Bioinformatics is used to abstract knowledge and principles from large-scale data, to present a complete representation of the cell and the organism, and to predict computationally systems of higher complexity, such as the interaction networks in cellular processes and the phenotypes of whole organisms” (Bayat, 2002).

Systems Biology is an integration of data from all levels of complexity (genomics, proteomics, metabolomics, and other molecular mechanisms) using “advanced computational methods to study how networks of interacting biological components determine the properties and activities of living systems” (HGP, 2003). The goal is to create overall computational models of the functioning of the cell, multicellular systems, and ultimately the organism. These in silico models will provide virtual test systems for evaluating the toxic responses of cells, tissues, and organisms.

Compounds will be tested in simulation studies before being applied to cells and tissues to obtain comparative results and validation of the system.

  • Basic lessons in pathway analysis are provided in the following articles and slide lectures: Viswanathan et al., 2008 Khatri et al., 2012 Vijay et al., 2012 MD Anderson Cancer Center NIEHS.
  • Some of the tools and databases useful to pathways analysis are compiled in Wikiomics
  • The Databases section of AltTox is another resource
Useful Concepts and Terms

Toxicity Pathways: “Today, cell biologists working in many fields are enhancing knowledge of cellular-response networks and elucidating the manner in which environmental agents perturb pathways to cause changes in cell behaviors. The NRC report defined toxicity pathways as biologic pathways that, when sufficiently perturbed, can lead to adverse health outcomes. Despite this new terminology, toxicity pathways are actually normal cellular-response pathways that can be targeted by environmental agents. A parallel exists in the field of carcinogenesis, in which genes that code for proteins involved in cell growth are designated as oncogenes or tumor suppression genes” (Andersen et al., 2008).

Adverse Outcome Pathway (AOP): The term AOP was developed as a framework for translating the mechanistic information derived from molecular, biochemical, and computational studies into endpoints that can be used to support chemical risk assessments. “An AOP is a conceptual construct that portrays existing knowledge concerning the linkage between a direct molecular initiating event and an adverse outcome at a biological level of organization relevant to risk assessment” (Ankley et al., 2010).

Genomics: The first of the -omics technologies to be developed, genomics has resulted in massive amounts of DNA sequence data requiring great amounts of computer capacity. Genomics has progressed beyond sequencing of organisms (structural genomics) to identifying the function of the encoded genes (functional genomics).

The genome of each species is distinctive, but smaller genomic differences are also observed between each individual of a species. It was originally thought that obtaining the sequence of the human genome would immediately tell us the identity of the human genes. The genome has proved to be much more complex.

When a gene is expressed it results in the production of a messenger RNA and ultimately a particular protein. Gene expression is not fully understood, but involves regulatory sequences within the DNA and the binding of specific regulatory proteins to these sequences. The expression and regulation of the regulatory proteins is another level of control. Whether a particular gene is expressed in an organism can be influenced by various genetic and environmental factors.

The DNA sequences of a gene that code for a protein are called exons, and they are interspersed with DNA called introns, which do not code for proteins. The intron sequences, previously thought to be nonsense material, are now known to also contain important information. Although the sequencing of the human genome was completed in 2003 (HGP, 2003), the identification of all of the genes within the human DNA sequence is not complete. Locating the beginning and ends of genes within the DNA remains a challenge.

Gene annotation is “adding pertinent information such as gene coded for, amino acid sequence or other commentary to the database entry of raw sequence of DNA bases” (HGP, 2003). This involves describing different regions of the code, identifying which regions can be called genes, and identifying other features such as exons and introns, start and stop codons, and so on.

For more background information on DNA, genes, and genome sequencing, take a look at the online book What’s a Genome?.

Epigenetics: Epigenetics refers to mechanisms that persistently alter gene expression without actual changes to the gene/DNA sequence. DNA methylation is an example of an epigenetic mechanism. Scientists have shown that DNA methylation is an important component in a variety of chemical-induced toxicities, including carcinogenicity, and is a mechanism that should be assessed in the overall hazard assessment (Watson & Goodman, 2002 Moggs, et al., 2004).

Proteomics: Proteins are the primary structural and functional molecules in the cell, and are made up of a linear arrangement of amino acids. The linear polypeptide chains are folded into secondary and tertiary structures to form the functional protein. Unlike the static nature of the cell’s genes, proteins are constantly changing to meet the needs of the cell.

Characterizing the identity, function, regulation, and interaction of all of the cellular proteins of an organism, the proteome, will be a major achievement. Studies of changes in the proteome of cells and tissues exposed to toxic materials, compared to normal cells, is being used to develop an understanding of the mechanisms of toxicity. As proteomics tools become more powerful and widely used, protein and proteome changes in response to exposures to toxic substances (fingerprints or response profiles) will be developed into databases that can be used to classify exposure responses at various levels of organization of the organism, thus providing a predictive in silico toxicology tool.

Metabolomics: Metabolomics refers to the comprehensive evaluation of the metabolic state of a cell, organ or organism, in order to identify biochemical changes that are characteristic of specific disease states or toxic insults. Typical metabolomics experiments involve the identification and quantitation of large numbers of endogenous molecules in a biological sample (e.g., urine or blood) using chemical techniques such as chromatography and mass spectrometry. The output from these techniques is compared to computerized libraries of mass spectrometry tracings to facilitate identification of the compounds that are present. Environmental stresses such as exposure to chemicals or drugs alter the metabolic pathways in cells, and metabolite profiling can be used to assess toxic responses/exposures.

Biomarkers: Broadly defined, biomarkers are “characteristics [typically a biomolecule(s)] that can be objectively measured and evaluated as an indicator of normal biologic or pathogenic processes or pharmacological responses to a therapeutic intervention” (Cummins, 2007). Animal models are still commonly used to look for biomarkers relevant to human drug development, toxicity responses, and disease processes. To develop useful human biomarkers for toxicity, cell and tissue models that can express known biomarkers of toxicity need to be developed and validated against clinical samples. One challenge with toxicity biomarkers is that humans cannot be purposefully exposed to toxic materials to obtain clinical samples.

Relation of DNA (genes) to Proteins: Each gene is a linear stretch of DNA nucleotides that codes for the assembly of amino acids into a polypeptide chain (protein). DNA is transcribed into messenger RNA (mRNA) (transcription) which is then translated by the ribosomes into the amino acid chains that will make up the protein (translation).

Mutations are changes in DNA bases (insertions, deletions, translocations) that may result in changes to the proteins that are synthesized, or even prevent their synthesis. Chemicals that are mutagens can cause permanent heritable changes in the DNA sequence.

Regulation of Gene Expression: Some proteins are constitutively expressed (present all of the time), but cells can regulate the expression of proteins that are not needed all of the time or in large amounts. This provides cells with control mechanisms for turning metabolic reactions on and off. Cells use a variety of mechanisms to regulate gene expression, and thus which proteins are produced. Proteins can be controlled or regulated at the level of their synthesis (regulation of gene transcription), gene translation, various post-translation mechanisms and feedback inhibition, or the recently discovered actions of RNAi and microRNA.

Short interfering RNA (siRNA) are short double-stranded RNAs (dsRNA) that can regulate gene expression. In eukaryotic cells, the enzyme Dicer produces siRNA from small dsRNAs. The siRNA can bind to its complementary messenger RNA (mRNA) and inhibit translation and/or induce the cell to destroy the mRNA. The phenomenon is called RNA inhibition (RNAi), and can be used in the lab to inhibit any gene in any kind of cell (Dove, 2007). “RNA interference has re-energized the field of functional genomics by enabling genome-scale loss-of-function screens in cultured cells” (Echeverri & Perrimon, 2006).

MicroRNA (miRNA) is a recently discovered class of small non-coding RNAs. Cells use miRNA to regulate the amount of protein synthesized by a gene by the mechanisms of translational inhibition and mRNA destabilization (Bushati & Cohen, 2007). Over 250 miRNAs have been discovered.

Microarrays: Genomics and proteomics research has been advanced through the development of experimental techniques that increase throughput, such as microarrays. Microarrays consist of DNA or protein fragments placed as small spots onto a slide, which are then used as “miniaturized chemical reaction areas” (HGP, 2003). The studies typically involve looking for changes in gene or protein expression patterns by cells or tissues under different conditions. Microarrays provide a platform for evaluating the changes in many (usually thousands of) genes or proteins simultaneously.

High Throughput Screening (HTS) consists of assays developed to produce and analyze many individual data points or results in one experiment. Assays using DNA or other microarrays or multiwell plates of cells that are processed using robotic systems are examples of HTS assays. The US National Toxicology Program identified HTS as an essential tool for screening the thousands of chemicals currently in the US marketplace for potential human toxicity.

Toxicogenomics compares the genes expressed in organisms that have been exposed to a drug, chemical, or toxin to those of unexposed organisms (negative controls). The up or down regulation of certain genes or groups of genes may be linked to toxic responses occurring in the organism, and to particular organs or cell types in that organism. The goal of toxicogenomics is to identify patterns of gene expression related to specific chemicals or chemical classes so that these expression patterns can be used as endpoints for assessing toxicity. Thus far, toxicogenomics has been useful in refining animal experiments and identifying mechanisms of toxicity in lab animals where exposures can be controlled. There have also been experiments evaluating gene expression in cell cultures exposed to toxicants, which has been used in limited applications for prediction of in vivo toxicity.

Pharmacogenetics looks at the differences in response to a particular drug that are due to variations in the genetic makeup of individuals. For example, human genetic variation has been implicated in the variability of responses (effectiveness and/or toxicity) seen with some chemotherapeutic drugs (Crews, 2006 Hahn, et al., 2006).

Author(s)/Contributor(s):
Sherry L. Ward, PhD, MBA
AltTox Contributing Editor

AltTox Editorial Board reviewer(s):
George Daston, PhD
Procter & Gamble


How is output of DNA assembler measured? - Biology

This will download and compile all tools necessary to run a benchmark.

  • Cgmemtime to measure runtime andmaximum memory consumption
  • Quast to estimate assembly quality
  • Everyhting needed to run each denovo assembly tool for whom a wrapper is defined in folder: ¨¨¨NanoMark/wrappers```

All tools will be downloaded to folder NanoMark/tools .

Run benchmark with a following command: NanoMark.py --benchmark <reads_file> <reference_file>

Benchmark results will be stored in folder: /intermediate Each benchmark will have its own folder with a randomly generated name, e.g. benchmark_129eed95-377d-4e8e-a5bf-309631e3df3d. Inside will be a folder for each assembler and a folder for Quast results (containing quast results for each assembler).

Summary Quast and cgmemtime results for each assembler are stored in a benchmark folder in a .tsv file benchmark_summary.tsv. The file contains the following values for each assembler:

  • Real time : real execution time
  • CPU time : CPU execution time
  • Maximum RSS : maximum memory consumption

contigs : number of contigs larger then 500bp generated by an assembler

misassemblies : number of missasembiled generated by an assembler

Quast generated fields present in summarized results are represented in a list "qfields" in function "summarize_results" in NanoMark.py. New field can be added by modifying that list.

Including new assemblers in the benchmark

The benchmarking tool currently includes following assembly tools:

  • Loman, Quick and Simpson assembly pipeline (http://www.nature.com/nmeth/journal/v12/n8/full/nmeth.3444.html)
  • PBcR
  • FALCON
  • SPAdes
  • ALLPATHS-LG

Additional assembly tools can be included by writting a wrapper script in Python. Each assembler that needs to be included in the benchmark must have a corresponding wrapper in folder: /wrappers. Wrapper script filenames must start with "wrapper_"

Each wrapper must define three varibales:

  • Installation path (ASSEMBLER_PATH)
  • Assembler name written to benchmark summary .tsv file (ASSEMBLER_NAME)
  • Results file filename (must include relative path from assembler installation path) (ASSEMBLER_RESULTS)

Each wrapper must also define two functions:

  • download_and_install() : installs the assembler and makes it ready to run
  • run(reads_file, reference_file, machine_name, output_path, output_suffix='') : runs the assembler on given reads and reference files, results are stored in a given folder: output_folder. Attributes machine_name and output_suffix are currently not used, but must be included in function header for compatibility.

Included wrappers are good examples of wrapper implementation.

Installing an assembler (often requires sudo access):

Running the assembly process consists of specifying all reads files in the form:

More than one dataset can be described in the same command line, simply by listing them in a space-separated manner.
Reads_type can be one of: nanopore/pacbio/single/paired/mate. If reads_type != "paired" or "mate", last three parameters can be omitted.
If reads_type == "paired" or "mate", other end of the pair needs to be in another file provided by reads_path_b.

This work has been supported in part by Croatian Science Fundation under the project UIP-11-2013-7353.


HiFi DNA Assembly Protocol

  1. Set up the following reaction on ice:
    Recommended Amount of Fragments Used for Assembly
    2&ndash3 Fragment Assembly*4&ndash6 Fragment Assembly**NEBuilder Positive Control ✝
    Recommended DNA Molar Ratiovector:insert = 1:2vector:insert = 1:1
    Total Amount of Fragments0.03&ndash0.2 pmols*
    X &mul
    0.2&ndash0.5 pmols**
    X &mul
    10 &mul
    NEBuilder
    HiFi DNA Assembly Master Mix
    10 &mul10 &mul10 &mul
    Deionized H2O10-X &mul10-X &mul0
    Total Volume20 &mul ✝✝20 &mul ✝✝20 &mul
    *Optimized cloning efficiency is 50&ndash100 ng of vector with 2-fold excess of each insert. Use 5-fold molar excess of any insert(s) less than 200 bp. Total volume of unpurified PCR fragments in the assembly reaction should not exceed 20%. To achieve optimal assembly efficiency, design 15-20 bp overlap regions between each fragment.
    **To achieve optimal assembly efficiency, design 20-30 bp overlap regions between each fragment with equimolarity of all fragments (suggested: 0.05 pmol each).
    &daggerControl reagents are provided for 5 experiments.
    &dagger&daggerIf greater numbers of fragments are assembled, increase the volume of the reaction, and use additional NEBuilder HiFi DNA Assembly Master Mix.

Note: Extended incubation up to 60 minutes may help to improve assembly efficiency in some cases (for further details see FAQ section).


Environmental and Related Biotechnologies

6.01.3.6 Metagenomics Techniques

Metagenomics is the study of the entire complement of genetic material (usually total DNA, but sometimes also including cDNA prepared by reverse transcription of extracted community RNA) recovered directly from environmental samples or complex consortia (e.g., within a waste-treatment system or bioreactor). It can be a challenge to obtain DNA that is truly representative of an entire community from some natural environments such as chemically contaminated soils. Some bacteria and spores are much harder to lyse to release their nucleic acids than are others. Also, it is important that co-extracted contaminants (e.g., humic substances) that might interfere with downstream techniques, such as PCR, be removed during the DNA extraction and purification process. There are hundreds of papers in the literature that discuss such problems, and commercial DNA and RNA extraction kits are available that work most of the time however, every soil or environmental matrix is its own special case, and tweaking of extraction and purification procedures may be required in many instances to get clean nucleic acids for further study.

Metagenomics allows studies of organisms that are not easily cultured in a laboratory. This includes the vast majority of microorganisms, because most have never been cultured. This technique also is particularly useful for studies of organisms in their natural environment, even in the presence of thousands of cohorts. 20 Crawford et al. 21 used PCR with degenerate primers to amplify from PCP-contaminated soil, a variety of genes encoding the enzyme PCP-4-monooxygenase (PcpB), the initial enzyme in the catabolic pathway for degradation of PCP in numerous sphingomonads. Their work supported the idea that pcpB/PcpB can be considered a model system for the study of recent evolution of catabolic pathways among bacteria that degrade xenobiotic molecules introduced into the environment during the recent past. 21 For more details on the use of metagenomics in the bioremediation arena, the reader is referred to Chapter 6.06 Microbial Degradation of Polychlorinated Biphenyls .