What is the difference between second and third generation sequencing

What is the difference between second and third generation sequencing

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am writing the section about history of DNA sequencing in the introduction chapter and after reading quite a few research papers, I am still confused about them. Here I compile some questions to make sure that I understand it correctly.

  1. Is second generation sequencing the same as next generation sequencing?
  2. Is Sanger sequencing the first generation?
  3. As of today, is there any commercial 3rd generation sequencing technology in use (or is it still in development)?

If there is any paper reference that could explain all of the above then it would be really great.

I'm assuming you mean DNA sequencing (excluding things like RNA-seq).

Is Sanger sequencing the first generation?

From Metzker 2010:

The automated Sanger method is considered as a "first-generation" technology

Some of the technology that was in development when this review was written is no longer in development, but this is still an excellent review and a great place to start.

Is second generation sequencing the same as next generation sequencing?

Also from Metzker 2010:

and newer methods are referred to as next-generation sequencing (NGS).

The "newer methods" in the review are:

  • Emulsion PCR based like Roche/454
  • Solid phase amplification like Illumina/Solexa
  • Single molecule like Helicos BioSciences and Pacific Biosciences

As of today, is there any commercial 3rd generation sequencing technology in use (or is it still in development)?

The development of single molecule methods (third item in the above list) has historically taken longer than the other two, so sometimes they are referred to as third generation. However, note how Metzker above considers them all "next-gen".

A recent paper from PacBio itself makes the distinction of halting:

Second-generation sequencing (SGS): Sequencing of an ensemble of DNA molecules with wash-and-scan techniques.

Third-generation sequencing (TGS): Sequencing single DNA molecules without the need to halt between read steps (whether enzymatic or otherwise).

Ultimately, I think the "generations" are more useful to make the "Sanger vs. Not Sanger" distinction. Sanger sequencing is very different than next-gen in terms of what kind of experiment you can do and what sort of data you obtain.

Past that, the number of a generation is more of a marketing term. For instance, at the bottom of the PacBio paper:

Conflict of Interest statement. All authors are employees of Pacific Biosciences and own stock in the company.

In practice, PacBio competes with second generation platforms. It is in their interest to say their technology is on "another level" compared to what's available on the market, but in practice, this is debatable.

Anyhow, to answer your question:

  • There is a technology in use and commercially available which has been referred to as third gen (PacBio)
  • There was a technology in development that falls under most definitions of third gen (Helicos), which went bankrupt in 2012
  • There is a technology in development (ON MinION) that is single-molecule based and was introduced after PacBio, so if PacBio is third generations, so is MinION (although the MinION is apparently very different from PacBio)

Lentiviral Guide

To increase the safety of lentivirus, the components necessary for virus production are split across multiple plasmids (3 for 2nd-generation systems, 4 for 3rd-generation systems). The components of both systems are as follows:

  • Lentiviral transfer plasmid encoding your insert of interest. The transgene sequence is flanked by long terminal repeat (LTR) sequences, which facilitate integration of the transfer plasmid sequences into the host genome. Typically it is the sequences between and including the LTRs that is integrated into the host genome upon viral transduction. Many lentiviral transfer plasmids are based on the HIV-1 virus. For safety reasons, transfer plasmids are all replication incompetent and may contain an additional deletion in the 3'LTR, rendering the virus “self-inactivating” (SIN) after integration.
  • Packaging plasmid(s)
  • Envelope plasmid

Addgene’s packaging and envelope plasmids are generalized and appropriate for varied cell types and systems. When planning your experiment, the important component to consider and optimize is the transfer plasmid. 2nd generation lentiviral plasmids utilize the viral LTR promoter for gene expression, whereas 3rd-generation transfer plasmids utilize a hybrid LTR promoter (more information on this below). Additional or specialized promoters may also be included within a transfer plasmid: for example, the U6 promoter is included in the pSico plasmid to drive shRNA expression. Other features that can be included in transfer plasmids include: Tet- or Cre-based regulation and fluorescent fusions or reporters.
Browse lentivirus plasmids available from Addgene.


Next-generation sequencing (NGS) technologies have dramatically changed genomic research. NGS instruments, the so-called second-generation sequencers, generate large volumes of data compared with conventional Sanger sequencers. Before 2010, although the cost of reading a whole genome was rapidly decreasing, the use of NGS technologies was still limited to large genome sequencing centers because of technical and logistical difficulties associated with the operation of the instruments and requirements for computer hardware and data analysis. The advent of benchtop sequencers has accelerated sequencing efforts in small centers and laboratories. For example, the 454 GS Junior (GS Jr), released by Roche in early 2010 as the first benchtop sequencer, uses the same emulsion PCR technology [1] as the Roche GS FLX. The Life Technologies Ion PGM (Ion PGM) benchtop sequencer, which was launched at the beginning of 2011, utilizes semiconductor technology [2]. The Illumina MiSeq (MiSeq) benchtop sequencer became available at the end of 2011 and employs the same sequencing-by-synthesis technology [3, 4] as the Illumina GAII and HiSeq sequencers. With the annual emergence of new NGS instruments, experimental procedures such as library preparation and analysis methods require continual improvement.

Second-generation sequencers generate massive amounts of short reads, which differ in throughput and length from reads produced by Sanger sequencers. To assemble massive amounts of short reads, a new type of algorithm using de Bruijn graphs has flourished, as illustrated by a series of genome assemblers including ABySS [5], ALLPATHS-LG [6], Velvet [7, 8], and SOAPdenovo [9]. Although these algorithms [5–9] have been developed to produce high-quality finished-grade genomes, it remains a challenge to assemble long contigs spanning an entire genome. One of the important factors in successfully obtaining finished genomes is resolving repetitive regions scattered across the genome. It is problematic to reconstruct long repetitive regions by assembling reads shorter than the repetitive regions. Paired ends and mate pairs have been used to tackle this problem. Mate pairs improved scaffold length, but the results using mate-pair assembly have usually been far from finished grade [10, 11].

To address this issue, reads longer than repetitive regions may offer a solution to the assembly problem. The recently launched third-generation Pacific Biosciences RS sequencer (PacBio) system [12] generates long reads with a mean length of 4.5 kbp and with randomly distributed sequencing errors. This evolutionary technology demands a new algorithm to process sequence reads because of the different nature of its reads, whose nucleotide-level accuracy is only 85% [12]. Therefore, several algorithms first correct sequencing errors in reads and then assemble the error-corrected reads [13–15]. PacBio has the advantage of generating long reads but at a throughput lower than that of the second-generation sequencers. One of the disadvantages of PacBio is that the initial installation is more expensive than that of benchtop second-generation sequencers (Additional file 1: Table S1). Combining second- and third-generation sequencing data may be an option [13, 16] however, these hybrid methods offer limited efficiency because they require more labor and consumables costs for additional library preparation.

Given that various sequencing instruments and software are available for genome sequencing and are evolving, selecting the best one or the best combination is difficult. Performance comparisons of NGS instruments, including that of a third-generation sequencer, have been previously published [17–21] however, considering the rapid improvement of NGS technologies, frequent comparisons are valuable for selecting the platform providing the best results. Therefore, we performed an updated comparison study of second- and third-generation sequencers using the bacterial genome of Vibrio parahaemolyticus, consisting of two chromosomes. Because of the presence of two chromosomes with higher copy numbers of rRNA operons than found in other bacteria, it was difficult to finish the genome sequence [21]. In this study, we demonstrated the reconstruction of the V. parahaemolyticus genome using current sequencers.

Potential uses of NGS in clinical practice

Clinical genetics

There are numerous opportunities to use NGS in clinical practice to improve patient care, including:

NGS captures a broader spectrum of mutations than Sanger sequencing

The spectrum of DNA variation in a human genome comprises small base changes (substitutions), insertions and deletions of DNA, large genomic deletions of exons or whole genes and rearrangements such as inversions and translocations. Traditional Sanger sequencing is restricted to the discovery of substitutions and small insertions and deletions. For the remaining mutations dedicated assays are frequently performed, such as fluorescence in situ hybridisation (FISH) for conventional karyotyping, or comparative genomic hybridisation (CGH) microarrays to detect submicroscopic chromosomal copy number changes such as microdeletions. However, these data can also be derived from NGS sequencing data directly, obviating the need for dedicated assays while harvesting the full spectrum of genomic variation in a single experiment. The only limitations reside in regions which sequence poorly or map erroneously due to extreme guanine/cytosine (GC) content or repeat architecture, for example, the repeat expansions underlying Fragile X syndrome, or Huntington's disease.

Genomes can be interrogated without bias

Capillary sequencing depends on preknowledge of the gene or locus under investigation. However, NGS is completely unselective and used to interrogate full genomes or exomes to discover entirely novel mutations and disease causing genes. In paediatrics, this could be exploited to unravel the genetic basis of unexplained syndromes. For example, a nationwide project, Deciphering Developmental Disorders, 1 running at the Wellcome Trust Sanger Institute in collaboration with NHS clinical genetics services aims to unravel the genetic basis of unexplained developmental delay by sequencing affected children and their parents to uncover deleterious de novo variants. Allying these molecular data with detailed clinical phenotypic information has been successful in identifying novel genes mutated in affected children with similar clinical features.

The increased sensitivity of NGS allows detection of mosaic mutations

Mosaic mutations are acquired as a postfertilisation event and consequently they present at variable frequency within the cells and tissues of an individual. Capillary sequencing may miss these variants as they frequently present with a subtlety which falls below the sensitivity of the technology. NGS sequencing provides a far more sensitive read-out and can therefore be used to identify variants which reside in just a few per cent of the cells, including mosaic variation. In addition, the sensitivity of NGS sequencing can be increased further, simply by increasing sequencing depth. This has seen NGS employed for very sensitive investigations such as interrogating foetal DNA from maternal blood 2 or tracking the levels of tumour cells from the circulation of cancer patients. 3

The Sanger Sequencing Method

The Sanger sequencing method relies on dideoxynucleotides (ddNTPs),a type of deoxynucleoside triphosphates (dNTPs), that lack a 3′ hydroxyl group and have a hydrogen atom instead . When these bases bind to the growing DNA sequence, they terminate replication as they cannot bind other bases. To perform Sanger Sequencing, you add your primers to a solution containing the genetic information to be sequenced, then divide up the solution into four PCR reactions. Each reaction contains a with dNTP mix with one of the four nucleotides substituted with a ddNTP (A, T, G, and C ddNTP groups). At the end of the PCR, each of your four reactions will yield PCR products of various lengths because replication is randomly terminated. By running the samples on a gel with 4 lanes, you can piece together the sequence as each sequence has been replicated from the same original material. Here is an example where the ddNTPs are in bold and the dNTPs are not:

Your sequence is ATGCTCAG.

Your four reactions give you:
Reaction with ddATP: A, ATGCTCA, ATGCTCAG
Reaction with ddGTP: ATG, ATGCTCAG
Reaction with ddTTP: AT, ATGCT, ATGCTCAG

All the reactions once run a gel would look something like this (Image by Olwen Reina):

Each band denotes the different lengths code. For example, the band is the right under the “A” symbolizes the sequence: “ATGCTCA

Let’s imagine a party game. The game is a guessing game. Here is how it is played:

You are thinking of a number and the group has to guess it. The tricky part is that the number is 200-digits in length. You are reading the digits of the number in your head without making a sound. Every so often a person interrupts you, and you tell them the single digit you were just thinking and where it is in the sequence of 200. Each time you are interrupted, you have to start again. You leave after a few hours and the group has to figure out the 200-digit number. They have to piece together the information you gave them, for example the 25 th number was 5, the 40 th number was 0, and so on. Using the information from their interruptions, they can repeat the number they gave you.

While this sounds like the lamest game in the world, it works very well for sequencing!

Unfortunately, it is slow, expensive, and (previously) relies on radioactive materials. This pushed scientists to develop new and better forms of genome sequencing.

Next-Generation Sequencing Challenges

Over the past 10 years, next-generation sequencing (NGS) has grown by leaps and bounds. Outputs have gone up, and costs have come down—both by orders of magnitude. The NIH graph showing this progress is so overused that its main utility now is to help bored conference attendees fill in their “buzzword bingo” cards.

With well over 10,000 instruments installed around the world, we face a paradox: the current generation and the next generation are one and the same. “Next,” in the context of sequencing, has almost completely lost its meaning. We might as well accept that “next-generation sequencing” is now just “sequencing.”

The major platform companies have spent the past couple of years focusing on improving ease-of-use. Illumina’s newer desktop systems, such as the NextSeq, MiSeq, and MiniSeq systems, all operate via the use of reagent cartridges, reducing the number of manipulations and “hands on” time.

The Ion Torrent platforms from Thermo Fisher Scientific have historically been more difficult to use than the Illumina platforms. However, Thermo’s most recent system, the Ion S5, was specifically engineered to simplify the entire workflow, from library prep through data generation.

After hearing about sequencing’s many improvements—greater output, lower costs, and better ease of use—the casual observer may imagine that all of the hard work has been done and that all the barriers to progress have been removed. But the hard work has just started, and many challenges remain.

One of the first areas where problems can creep in is often the most overlooked—sample quality. Although platforms are often tested and compared using highly curated samples (such as the reference material from the Genome in a Bottle Consortium), real-world samples often present much more of a challenge.

For human sequencing, one of the most popular sample types is FFPE (formalin-fixed paraffin-embedded). FFPE is popular for a variety of reasons, not the least of which is the sheer abundance of FFPE samples. According to some estimates, over a billion FFPE samples are archived around the world. This number will continue to grow now that the storage of clinical samples in FFPE blocks has become an industry-wide standard practice.

Besides being widely available, FFPE samples often contain incredibly useful phenotypic information. For example, FFPE samples are often associated with medical treatment and clinical outcome data.

The problem with FFPE samples is that both the process of fixation and the storage conditions can cause extensive DNA damage. “In evaluating over 1,000 samples on BioCule’s QC platform, we’ve seen tremendous variability in the amount and types of damage in sample DNA, such as inter- and intrastrand crosslinks, accumulation of single-stranded DNA, and single-strand DNA breaks,” says Hans G. Thormar, Ph.D., co-founder and CEO of BioCule.

The variable amounts and types of damage, if ignored, can negatively affect the final results. “The impact on downstream applications such as sequencing can be profound: from simple library failures to libraries that produce spurious data, leading to misinterpretation of the results,” continues Dr. Thormar. Therefore, it is critical to properly assess the quality of each sample at the beginning of the sequencing project.

In next-generation sequencing workflows, samples of low or variable quality can corrupt downstream processes such as library preparation and ultimately confound analysis. Samples should be assessed for crosslinks, breaks, the accumulation of single-stranded DNA, and other forms of damage.

Library Prep

Although the major sequencing platform companies have spent years bringing down the cost of generating raw sequence, the same has not been true for library prep. Library prep for human whole-genome sequencing, at about $50 per sample, is still a relatively minor part of the total cost. But for other applications, such as sequencing bacterial genomes or low-depth RNA sequencing (RNA-seq), it can account for the majority of the cost.

Several groups are working on multiplexed homebrew solutions to bring the effective costs down, but there haven’t been many developments on the commercial front. One bright spot is in the development of single-cell sequencing solutions, such as the Chromium™ system from 10X Genomics, which uses a bead-based system for processing hundreds to thousands of samples in parallel.

“We see single-cell RNA-seq as the right way to do gene expression analysis,” insists Serge Saxonov, Ph.D., co-founder and CEO of 10X Genomics. “Over the next several years, much of the world will transition to single-cell resolution for RNA experiments, and we are excited for our platform to lead the way there.” For large projects, such as those required for single-cell RNA-seq, highly multiplexed solutions will be critical in keeping per-sample costs reasonably low.

Short Reads vs. Long Reads

Illumina’s dominance of the sequencing market has meant that the vast majority of the data that has been generated so far is based on short reads. Having a large number of short reads is a good fit for a number of applications, such as detecting single-nucleotide polymorphisms in genomic DNA and counting RNA transcripts. However, short reads alone are insufficient in a number of applications, such as reading through highly repetitive regions of the genome and determining long-range structures.

Long-read platforms, such as the RSII and Sequel from Pacific Biosciences and the MinION from Oxford Nanopore Technologies, are routinely able to generate reads in the 15–20 kilobase (kb) range, with individual reads of over 100 kb having been reported. Such platforms have earned the respect of scientists such as Charles Gasser, Ph.D., professor of molecular and cellular biology at the University of California, Davis.

“I am impressed with the success people have had with using the long-read methods for de novo genome assembly, especially in hybrid assemblies when combined with short-read higher fidelity data,” comments Dr. Gasser. “This combination of technologies makes it possible for a single investigator with a very small group and a minimal budget to produce a useable assembly from a new organism’s genome.”

To get the most out of these long-read platforms, however, it is necessary to use new methods for the preparation of DNA samples. Standard molecular biology methods haven’t been optimized for isolating ultra-long DNA fragments, so special care must be taken when preparing long-read libraries.

For example, vendors have created special “high molecular weight” kits for the isolation of DNA fragments >100 kb, and targeted DNA protocols have been modified to selectively enrich for large fragments of DNA. These new methods and techniques need to be mastered to ensure maximum long-read yield.

As an alternative to true long reads, some are turning to a specialized form of short reads called linked-reads, such as those from 10X Genomics. Linked-reads are generated by adding a unique barcode to each short read generated from a single long DNA fragment, which is generally >100 kb. The unique barcodes are used to link together the individual short reads during the analysis process. This provides long-range genomic information, enabling the construction of large haplotype blocks and elucidation of complex structural information.

“Short-read sequencing, while immensely powerful because of high accuracy and throughput, can only access a fraction of genomic content,” advises Dr. Saxonov. “This is because genomes are substantially repetitive and much of the information in the genome is encoded at long scales.”

Some sequencing applications, such as the detection of single nucleotide polymorphisms, can be managed with short-read technology. Other applications, such as the detection of structural variants, may demand long-read technology, and some applications, such as the assembly of a new organism’s genome, may require a combined approach, with short reads providing accuracy and high throughput, where possible, and long reads coping with highly repetitive genomic regions. [ktsimage/Getty Images]

Data Analysis

Another challenge facing researchers is the sheer amount of data being generated. The BAM file (a semicompressed alignment file) for a single 30X human whole-genome sample is about 90 GB. A relatively modest project of 100 samples would generate 9 TB of BAM files.

With a single Illumina HiSeq X instrument capable of generating over 130 TB of data per year, storage can quickly become a concern. For example, the Broad Institute is generating sequencing data at the rate of one 30X genome every 12 minutes—nearly 4,000 TB worth of BAM files every year.

BAM files may be converted into VCF (variant call format) files, which contain information only on those bases that differ from the reference sequence. Although the VCF files are much smaller and easier to work with, it is still necessary to retain the the raw sequence files if the researcher is to reprocess the data in the future.

As the cost of sequencing has come down, some have come to the conclusion that resequencing samples for which there is abundant material is easier and possibly even cheaper. And when it comes to analyzing this large amount of data, researchers are spoiled for choice. In fact, with well over 3,000 sequencing analysis tools listed at OMICtools (a directory operated by omicX), researchers can easily be overwhelmed when trying to find the best option.

Clinical Interpretation and Reimbursement

Finally, for clinical samples, there remains the challenge of delivering a consistent, reliable interpretation of the sequencing variants, especially as it pertains to patient care. A typical exome sample will have between 10,000 to 20,000 variants, whereas a whole-genome sample will generally have greater than 3 million. To make things more manageable, the variants are often filtered based on their likelihood to cause disease.

To help guide clinicians, the American College of Medical Genetics and Genomics, the Association for Molecular Pathology, and College of American Pathologists have created a system for classifying variants. Categories include pathogenic, likely pathogenic, uncertain significance (which currently makes up the vast majority in exome and whole-genome samples), likely benign, and benign.

Such schemes, however, have their limitations. Even when a common classification scheme is used on identical datasets, different groups may come up with different interpretations. In a pilot study under the new system, the participating clinical laboratories agreed on their classifications only about 34% of the time.

In cases where there is disagreement or additional analysis is needed to interpret the results, the problem of reimbursement becomes the roadblock. Reimbursement of NGS-based tests can be a major challenge, but reimbursement for interpretation is nearly impossible.

“There’s no way for laboratories to bill for interpretation,” argues Jennifer Friedman, M.D., clinical investigator at Rady Children’s Institute for Genomic Medicine. “It’s a very valuable service that could be available, but nobody is really in that space.

“There’s no way to bill for it—insurance companies won’t pay for it. Despite increasing focus on precision medicine, whether interpretation is by the clinician or by the lab, this most important aspect is not recognized or valued by the healthcare payers.”

Until this changes, the analysis of these patient samples essentially has to be treated as a research project, an option generally available only in a research hospital setting, and only for a limited number of patients.

Looking Forward

As much advancement as there has been over the past several years, many challenges remain across the entire NGS workflow, from sample prep through data analysis. And as new advancements are made in the underlying technologies, new challenges will continue to emerge. Rising to these challenges will be critical to ensuring the wide adoption of these genomic technologies and to maximizing their impact on human health.

The Long and Short of Structural Variants

Although next-generation sequencing has contributed to rapid progress in our ability to detect single-base genetic variation, another entire category of variants has been left out of the picture due to the nature of the short-read sequences produced by these platforms. These variants are too small to detect with cytogenetic methods, but too large to reliably discover with short-read sequencing. This is no trivial matter: each human genome contains about 20,000 structural variants, and many have been shown to cause disease.

Single-molecule, real-time (SMRT) sequencing technology is solving the challenge of identifying these structural variants with high sensitivity, in part due to the fundamentally long reads it produces. SMRT sequencing produces reads that are many kilobases long— compared to 200 or 300 bases for short-read sequencers—so they can fully resolve most structural variants such as insertions, deletions, duplications, inversions, repeat expansions, and more.

Many studies are now using long-read SMRT-sequence data for structural variant discovery. In a project presented last year at the American Society of Human Genetics, the NA12878 human sample was sequenced to 10-fold coverage on Pacific Biosciences’ Sequel System, and structural variants were called with the Baylor College of Medicine’s PBHoney tool.

This approach found nearly 90% of structural variants in the genome, based on a comparison to a Genome in a Bottle truth set. Furthermore, long-read coverage identified thousands of novel variants not found in short-read datasets, most of which were confirmed by de novo assembly.

As efforts turn to analysis of structural variants in large cohorts, it is important to strike a balance between sensitivity and cost. Low-fold SMRT-sequencing coverage has the potential to be an effective and affordable solution for structural variant discovery in human genomes, and the benefits apply to other complex genomes as well.

Since Human Genome Project, the cost of sequencing genomes has decreased more than a thousand-fold.

Since the Human Genome Project the development of newer and better DNA sequencing technologies has led to the cost of sequencing genomes decreasing more than a thousand-fold. No sooner had next-generation sequencing reached the market than a third generation of sequencing was being developed.

SMRT enables scientists to effectively ‘eavesdrop’ on DNA polymerase.

One of these new technologies was developed by Pacific Biosciences and is called Single-Molecule Sequencing in Real Time (SMRT). This system involves a single-stranded molecule of DNA which attaches to a DNA polymerase enzyme. The DNA is sequenced as the DNA polymerase adds complementary fluorescently-labelled bases to the DNA strand. As each labelled base is added, the fluorescent colour of the base is recorded before the fluorescent label is cut off. The next base in the DNA chain can then be added and recorded.

SMRT is very efficient which means that fewer expensive chemicals have to be used. It is also incredibly sensitive, enabling scientists to effectively ‘eavesdrop’ on DNA polymerase and observe it making a strand of DNA.

The Single-Molecule Sequencing in Real Time (SMRT) developed by Pacific Biosciences.
Image credit: Genome Research Limited

SMRT can generate very long reads of sequence of 10-15 kilobases long.

SMRT can generate very long reads of sequence (10-15 kilobases long) from single molecules of DNA, very quickly. Producing long reads is very important because it is easier to assemble genomes from longer fragments of DNA. It also means that for small genomes the complete sequence can be obtained without the need for the expensive and time-consuming gap closing methods that other technologies require.

With third generation sequencing scientists can now begin to re-sequence genomes to achieve a higher level of accuracy.

With the introduction of such sensitive and cheap sequencing methods scientists can now begin to re-sequence genomes that have already been sequenced to achieve a higher level of accuracy. For example, using SMRT, Escherichia coli has now been sequenced to an accuracy of 99.9999 per cent!

Sequencing the human genome in this way won’t be possible for a while, but when it is, scientists predict that it will be possible to sequence an entire human genome in about an hour.

A mere decade on from the Human Genome Project and DNA is now being sequenced far quicker and more efficiently.

A graph showing how the speed of DNA sequencing technologies has increased since the early techniques in the 1980s.
Image credit: Genome Research Limited

Comparison of Next-Generation Sequencing Systems

With fast development and wide applications of next-generation sequencing (NGS) technologies, genomic sequence information is within reach to aid the achievement of goals to decode life mysteries, make better crops, detect pathogens, and improve life qualities. NGS systems are typically represented by SOLiD/Ion Torrent PGM from Life Sciences, Genome Analyzer/HiSeq 2000/MiSeq from Illumina, and GS FLX Titanium/GS Junior from Roche. Beijing Genomics Institute (BGI), which possesses the world’s biggest sequencing capacity, has multiple NGS systems including 137 HiSeq 2000, 27 SOLiD, one Ion Torrent PGM, one MiSeq, and one 454 sequencer. We have accumulated extensive experience in sample handling, sequencing, and bioinformatics analysis. In this paper, technologies of these systems are reviewed, and first-hand data from extensive experience is summarized and analyzed to discuss the advantages and specifics associated with each sequencing system. At last, applications of NGS are summarized.

1. Introduction

(Deoxyribonucleic acid) DNA was demonstrated as the genetic material by Oswald Theodore Avery in 1944. Its double helical strand structure composed of four bases was determined by James D. Watson and Francis Crick in 1953, leading to the central dogma of molecular biology. In most cases, genomic DNA defined the species and individuals, which makes the DNA sequence fundamental to the research on the structures and functions of cells and the decoding of life mysteries [1]. DNA sequencing technologies could help biologists and health care providers in a broad range of applications such as molecular cloning, breeding, finding pathogenic genes, and comparative and evolution studies. DNA sequencing technologies ideally should be fast, accurate, easy-to-operate, and cheap. In the past thirty years, DNA sequencing technologies and applications have undergone tremendous development and act as the engine of the genome era which is characterized by vast amount of genome data and subsequently broad range of research areas and multiple applications. It is necessary to look back on the history of sequencing technology development to review the NGS systems (454, GA/HiSeq, and SOLiD), to compare their advantages and disadvantages, to discuss the various applications, and to evaluate the recently introduced PGM (personal genome machines) and third-generation sequencing technologies and applications. All of these aspects will be described in this paper. Most data and conclusions are from independent users who have extensive first-hand experience in these typical NGS systems in BGI (Beijing Genomics Institute).

Before talking about the NGS systems, we would like to review the history of DNA sequencing briefly. In 1977, Frederick Sanger developed DNA sequencing technology which was based on chain-termination method (also known as Sanger sequencing), and Walter Gilbert developed another sequencing technology based on chemical modification of DNA and subsequent cleavage at specific bases. Because of its high efficiency and low radioactivity, Sanger sequencing was adopted as the primary technology in the “first generation” of laboratory and commercial sequencing applications [2]. At that time, DNA sequencing was laborious and radioactive materials were required. After years of improvement, Applied Biosystems introduced the first automatic sequencing machine (namely AB370) in 1987, adopting capillary electrophoresis which made the sequencing faster and more accurate. AB370 could detect 96 bases one time, 500 K bases a day, and the read length could reach 600 bases. The current model AB3730xl can output 2.88 M bases per day and read length could reach 900 bases since 1995. Emerged in 1998, the automatic sequencing instruments and associated software using the capillary sequencing machines and Sanger sequencing technology became the main tools for the completion of human genome project in 2001 [3]. This project greatly stimulated the development of powerful novel sequencing instrument to increase speed and accuracy, while simultaneously reducing cost and manpower. Not only this, X-prize also accelerated the development of next-generation sequencing (NGS) [4]. The NGS technologies are different from the Sanger method in aspects of massively parallel analysis, high throughput, and reduced cost. Although NGS makes genome sequences handy, the followed data analysis and biological explanations are still the bottle-neck in understanding genomes.

Following the human genome project, 454 was launched by 454 in 2005, and Solexa released Genome Analyzer the next year, followed by (Sequencing by Oligo Ligation Detection) SOLiD provided from Agencourt, which are three most typical massively parallel sequencing systems in the next-generation sequencing (NGS) that shared good performance on throughput, accuracy, and cost compared with Sanger sequencing (shown in Table 1(a)). These founder companies were then purchased by other companies: in 2006 Agencourt was purchased by Applied Biosystems, and in 2007, 454 was purchased by Roche, while Solexa was purchased by Illumina. After years of evolution, these three systems exhibit better performance and their own advantages in terms of read length, accuracy, applications, consumables, man power requirement and informatics infrastructure, and so forth. The comparison of these three systems will be focused and discussed in the later part of this paper (also see Tables 1(a), 1(b), and 1(c)).

2. Roche 454 System

Roche 454 was the first commercially successful next generation system. This sequencer uses pyrosequencing technology [5]. Instead of using dideoxynucleotides to terminate the chain amplification, pyrosequencing technology relies on the detection of pyrophosphate released during nucleotide incorporation. The library DNAs with 454-specific adaptors are denatured into single strand and captured by amplification beads followed by emulsion PCR [6]. Then on a picotiter plate, one of dNTP (dATP, dGTP, dCTP, dTTP) will complement to the bases of the template strand with the help of ATP sulfurylase, luciferase, luciferin, DNA polymerase, and adenosine 5′ phosphosulfate (APS) and release pyrophosphate (PPi) which equals the amount of incorporated nucleotide. The ATP transformed from PPi drives the luciferin into oxyluciferin and generates visible light [7]. At the same time, the unmatched bases are degraded by apyrase [8]. Then another dNTP is added into the reaction system and the pyrosequencing reaction is repeated.

The read length of Roche 454 was initially 100–150 bp in 2005, 200000+ reads, and could output 20 Mb per run [9, 10]. In 2008 454 GS FLX Titanium system was launched through upgrading, its read length could reach 700 bp with accuracy 99.9% after filter and output 0.7 G data per run within 24 hours. In late 2009 Roche combined the GS Junior a bench top system into the 454 sequencing system which simplified the library preparation and data processing, and output was also upgraded to 14 G per run [11, 12]. The most outstanding advantage of Roche is its speed: it takes only 10 hours from sequencing start till completion. The read length is also a distinguished character compared with other NGS systems (described in the later part of this paper). But the high cost of reagents remains a challenge for Roche 454. It is about $

per base (counting reagent use only). One of the shortcomings is that it has relatively high error rate in terms of poly-bases longer than 6 bp. But its library construction can be automated, and the emulsion PCR can be semiautomated which could reduce the manpower in a great extent. Other informatics infrastructure and sequencing advantages are listed and compared with HiSeq 2000 and SOLiD systems in Tables 1(a), 1(b), and 1(c).

2.1. 454 GS FLX Titanium Software

GS RunProcessor is the main part of the GS FLX Titanium system. The software is in charge of picture background normalization, signal location correction, cross-talk correction, signals conversion, and sequencing data generation. GS RunProcessor would produce a series of files including SFF (standard flowgram format) files each time after run. SFF files contain the basecalled sequences and corresponding quality scores for all individual, high-quality reads (filtered reads). And it could be viewed directly from the screen of GS FLX Titanium system. Using GS De Novo Assembler, GS Reference Mapper and GS Amplicon Variant Analyzer provided by GS FLX Titanium system, SFF files can be applied in multiaspects and converted into fastq format for further data analyzing.

3. AB SOLiD System

(Sequencing by Oligo Ligation Detection) SOLiD was purchased by Applied Biosystems in 2006. The sequencer adopts the technology of two-base sequencing based on ligation sequencing. On a SOLiD flowcell, the libraries can be sequenced by 8 base-probe ligation which contains ligation site (the first base), cleavage site (the fifth base), and 4 different fluorescent dyes (linked to the last base) [10]. The fluorescent signal will be recorded during the probes complementary to the template strand and vanished by the cleavage of probes’ last 3 bases. And the sequence of the fragment can be deduced after 5 round of sequencing using ladder primer sets.

The read length of SOLiD was initially 35 bp reads and the output was 3 G data per run. Owing to two-base sequencing method, SOLiD could reach a high accuracy of 99.85% after filtering. At the end of 2007, ABI released the first SOLiD system. In late 2010, the SOLiD 5500xl sequencing system was released. From SOLiD to SOLiD 5500xl, five upgrades were released by ABI in just three years. The SOLiD 5500xl realized improved read length, accuracy, and data output of 85 bp, 99.99%, and 30 G per run, respectively. A complete run could be finished within 7 days. The sequencing cost is about

per base estimated from reagent use only by BGI users. But the short read length and resequencing only in applications is still its major shortcoming [13]. Application of SOLiD includes whole genome resequencing, targeted resequencing, transcriptome research (including gene expression profiling, small RNA analysis, and whole transcriptome analysis), and epigenome (like ChIP-Seq and methylation). Like other NGS systems, SOLiD’s computational infrastructure is expensive and not trivial to use it requires an air-conditioned data center, computing cluster, skilled personnel in computing, distributed memory cluster, fast networks, and batch queue system. Operating system used by most researchers is GNU/LINUX. Each solid sequencer run takes 7 days and generates around 4 TB of raw data. More data will be generated after bioinformatics analysis. This information is listed and compared with other NGS systems in Tables 1(a), 1(b), and 1(c). Automation can be used in library preparations, for example, Tecan system which integrated a Covaris A and Roche 454 REM e system [14].

3.1. SOLiD Software

After the sequencing with SOLiD, the original sequence of color coding will be accumulated. According to double-base coding matrix, the original color sequence can be decoded to get the base sequence if we knew the base types for one of any position in the sequence. Because of a kind of color corresponding four base pair, the color coding of the base will directly influence the decoding of its following base. It said that a wrong color coding will cause a chain decoding mistakes. BioScope is SOLiD data analysis package which provides a validated, single framework for resequencing, ChIP-Seq, and whole transcriptome analysis. It depends on reference for the follow-up data analysis. First, the software converts the base sequences of references into color coding sequence. Second, the color-coding sequence of references is compared with the original sequence of color-coding to get the information of mapping with newly developed mapping algorithm MaxMapper.

4. Illumina GA/HiSeq System

In 2006, Solexa released the Genome Analyzer (GA), and in 2007 the company was purchased by Illumina. The sequencer adopts the technology of sequencing by synthesis (SBS). The library with fixed adaptors is denatured to single strands and grafted to the flowcell, followed by bridge amplification to form clusters which contains clonal DNA fragments. Before sequencing, the library splices into single strands with the help of linearization enzyme [10], and then four kinds of nucleotides (ddATP, ddGTP, ddCTP, ddTTP) which contain different cleavable fluorescent dye and a removable blocking group would complement the template one base at a time, and the signal could be captured by a (charge-coupled device) CCD.

At first, solexa GA output was 1 G/run. Through improvements in polymerase, buffer, flowcell, and software, in 2009 the output of GA increased to 20 G/run in August (75PE), 30 G/run in October (100PE), and 50 G/run in December (Truseq V3, 150PE), and the latest GAIIx series can attain 85 G/run. In early 2010, Illumina launched HiSeq 2000, which adopts the same sequencing strategy with GA, and BGI was among the first globally to adopt the HiSeq system. Its output was 200 G per run initially, improved to 600 G per run currently which could be finished in 8 days. In the foreseeable future, it could reach 1 T/run when a personal genome cost could drop below $1 K. The error rate of 100PE could be below 2% in average after filtering (BGI’s data). Compared with 454 and SOLiD, HiSeq 2000 is the cheapest in sequencing with .02/million bases (reagent counted only by BGI). With multiplexing incorporated in P5/P7 primers and adapters, it could handle thousands of samples simultaneously. HiSeq 2000 needs (HiSeq control software) HCS for program control, (real-time analyzer software) RTA to do on-instrument base-calling, and CASAVA for secondary analysis. There is a 3 TB hard disk in HiSeq 2000. With the aid of Truseq v3 reagents and associated softwares, HiSeq 2000 has improved much on high GC sequencing. MiSeq, a bench top sequencer launched in 2011 which shared most technologies with HiSeq, is especially convenient for amplicon and bacterial sample sequencing. It could sequence 150PE and generate 1.5 G/run in about 10 hrs including sample and library preparation time. Library preparation and their concentration measurement can both be automated with compatible systems like Agilent Bravo, Hamilton Banadu, Tecan, and Apricot Designs.

4.1. HiSeq Software

HiSeq control system (HCS) and real-time analyzer (RTA) are adopted by HiSeq 2000. These two softwares could calculate the number and position of clusters based on their first 20 bases, so the first 20 bases of each sequencing would decide each sequencing’s output and quality. HiSeq 2000 uses two lasers and four filters to detect four types of nucleotide (A, T, G, and C). The emission spectra of these four kinds of nucleotides have cross-talk, so the images of four nucleotides are not independent and the distribution of bases would affect the quality of sequencing. The standard sequencing output files of the HiSeq 2000 consist of *bcl files, which contain the base calls and quality scores in each cycle. And then it is converted into *_qseq.txt files by BCL Converter. The ELAND program of CASAVA (offline software provided by Illumina) is used to match a large number of reads against a genome.

In conclusion, of the three NGS systems described before, the Illumina HiSeq 2000 features the biggest output and lowest reagent cost, the SOLiD system has the highest accuracy [11], and the Roche 454 system has the longest read length. Details of three sequencing system are list in Tables 1(a), 1(b), and 1(c).

5. Compact PGM Sequencers

Ion Personal Genome Machine (PGM) and MiSeq were launched by Ion Torrent and Illumina. They are both small in size and feature fast turnover rates but limited data throughput. They are targeted to clinical applications and small labs.

5.1. Ion PGM from Ion Torrent

Ion PGM was released by Ion Torrent at the end of 2010. PGM uses semiconductor sequencing technology. When a nucleotide is incorporated into the DNA molecules by the polymerase, a proton is released. By detecting the change in pH, PGM recognized whether the nucleotide is added or not. Each time the chip was flooded with one nucleotide after another, if it is not the correct nucleotide, no voltage will be found if there is 2 nucleotides added, there is double voltage detected [15]. PGM is the first commercial sequencing machine that does not require fluorescence and camera scanning, resulting in higher speed, lower cost, and smaller instrument size. Currently, it enables 200 bp reads in 2 hours and the sample preparation time is less than 6 hours for 8 samples in parallel.

An exemplary application of the Ion Torrent PGM sequencer is the identification of microbial pathogens. In May and June of 2011, an ongoing outbreak of exceptionally virulent Shiga-toxin- (Stx) producing Escherichia coli O104:H4 centered in Germany [16, 17], there were more than 3000 people infected. The whole genome sequencing on Ion Torrent PGM sequencer and HiSeq 2000 helped the scientists to identify the type of E. coli which would directly apply the clue to find the antibiotic resistance. The strain appeared to be a hybrid of two E. coli strains—entero aggregative E. coli and entero hemorrhagic E. coli—which may help explain why it has been particularly pathogenic. From the sequencing result of E. coli TY2482 [18], PGM shows the potential of having a fast, but limited throughput sequencer when there is an outbreak of new disease.

In order to study the sequencing quality, mapping rate, and GC depth distribution of Ion Torrent and compare with HiSeq 2000, a high GC Rhodobacter sample with high GC content (66%) and 4.2 Mb genome was sequenced in these two different sequencers (Table 2). In another experiment, E. coli K12 DH10B (NC_010473.1) with GC 50.78% was sequenced by Ion Torrent for analysis of quality value, read length, position accuracies, and GC distribution (Figure 1).

(c) Ion Torrent sequencing quality. E. coli K12 DH10B (NC_010473.1) with GC 50.78% was used for this experiment. (a) is 314–200 bp from Ion Torrent. The left figure is quality value: pink range represents quality minimum and maximum values each position has. Green area represents the top and bottom quarter (1/4) reads of quality. Red line represents the average quality value in the position. The right figure is read length analysis: colored histogram represents the real read length. The black line represents the mapped length, and because it allows 3′ soft clipping, the length is different from the real read length. (b) is accuracy analysis. In each position, accuracy type including mismatch, insertion, and deletion is shown on the left

-axis. The average accuracy is shown the right

5.1.1. Sequencing Quality

The quality of Ion Torrent is more stable, while the quality of HiSeq 2000 decreases noticeably after 50 cycles, which may be caused by the decay of fluorescent signal with increasing the read length (shown in Figure 1).

5.1.2. Mapping

The insert size of library of Rhodobacter was 350 bp, and 0.5 Gb data was obtained from HiSeq. The sequencing depth was over 100x, and the contig and scaffold N50 were 39530 bp and 194344 bp, respectively. Based on the assembly result, we used 33 Mb which is obtained from ion torrent with 314 chip to analyze the map rate. The alignment comparison is Table 2.

The map rate of Ion Torrent is higher than HiSeq 2000, but it is incomparable because of the different alignment methods used in different sequencers. Besides the significant difference on data including mismatch rate, insertion rate, and deletion rate, HiSeq 2000 and Ion Torrent were still incomparable because of the different sequencing principles. For example, the polynucleotide site could not be indentified easily in Ion Torrent. But it is shown that Ion Torrent has a stable quality along sequencing reads and a good performance on mismatch accuracies, but rather a bias in detection of indels. Different types of accuracy are analyzed and shown in Figure 1.

5.1.3. GC Depth Distribution

The GC depth distribution is better in Ion Torrent from Figure 1. In Ion Torrent, the sequencing depth is similar while the GC content is from 63% to 73%. However in HiSeq 2000, the average sequencing depth is 4x when the GC content is 60%, while it is 3x with 70% GC content.

Ion Torrent has already released Ion 314 and 316 and planned to launch Ion 318 chips in late 2011. The chips are different in the number of wells resulting in higher production within the same sequencing time. The Ion 318 chip enables the production of >1 Gb data in 2 hours. Read length is expected to increase to >400 bp in 2012.

5.2. MiSeq from Illumina

MiSeq which still uses SBS technology was launched by Illumina. It integrates the functions of cluster generation, SBS, and data analysis in a single instrument and can go from sample to answer (analyzed data) within a single day (as few as 8 hours). The Nextera, TruSeq, and Illumina’s reversible terminator-based sequencing by synthesis chemistry was used in this innovative engineering. The highest integrity data and broader range of application, including amplicon sequencing, clone checking, ChIP-Seq, and small genome sequencing, are the outstanding parts of MiSeq. It is also flexible to perform single 36 bp reads (120 MB output) up to 2 × 150 paired-end reads (1–1.5 GB output) in MiSeq. Due to its significant improvement in read length, the resulting data performs better in contig assembly compared with HiSeq (data not shown). The related sequencing result of MiSeq is shown in Table 3. We also compared PGM with MiSeq in Table 4.

5.3. Complete Genomics

Complete genomics has its own sequencer based on Polonator G.007, which is ligation-based sequencer. The owner of Polonator G.007, Dover, collaborated with the Church Laboratory of Harvard Medical School, which is the same team as SOLiD system, and introduced this cheap open system. The Polonator could combine a high-performance instrument at very low price and the freely downloadable, open-source software and protocols in this sequencing system. The Polonator G.007 is ligation detection sequencing, which decodes the base by the single-base probe in nonanucleotides (nonamers), not by dual-base coding [19]. The fluorophore-tagged nonamers will be degenerated by selectively ligate onto a series of anchor primers, whose four components are labeled with one of four fluorophores with the help of T4 DNA ligase, which correspond to the base type at the query position. In the ligation progress, T4 DNA ligase is particularly sensitive to mismatches on 3′-side of the gap which is benefit to improve the accuracy of sequencing. After imaging, the Polonator chemically strips the array of annealed primer-fluorescent probe complex the anchor primer is replaced and the new mixture are fluorescently tagged nonamers is introduced to sequence the adjacent base [20]. There are two updates compared with Polonator G.007, DNA nanoball (DNB) arrays, and combinatorial probe-anchor ligation (cPAL). Compared with DNA cluster or microsphere, DNA nanoball arrays obtain higher density of DNA cluster on the surface of a silicon chip. As the seven 5-base segments are discontinuous, so the system of hybridization-ligation-detection cycle has higher fault-tolerant ability compared with SOLiD. Complete genomics claim to have 99.999% accuracy with 40x depth and could analyze SNP, indel, and CNV with price 5500$–9500$. But Illumina reported a better performance of HiSeq 2000 use only 30x data (Illumina Genome Network). Recently some researchers compared CG’s human genome sequencing data with Illumina system [21], and there are notable differences in detecting SNVs, indels, and system-specific detections in variants.

5.4. The Third Generation Sequencer

While the increasing usage and new modification in next generation sequencing, the third generation sequencing is coming out with new insight in the sequencing. Third-generation sequencing has two main characteristics. First, PCR is not needed before sequencing, which shortens DNA preparation time for sequencing. Second, the signal is captured in real time, which means that the signal, no matter whether it is fluorescent (Pacbio) or electric current (Nanopore), is monitored during the enzymatic reaction of adding nucleotide in the complementary strand.

Single-molecule real-time (SMRT) is the third-generation sequencing method developed by Pacific Bioscience (Menlo Park, CA, USA), which made use of modified enzyme and direct observation of the enzymatic reaction in real time. SMRT cell consists of millions of zero-mode waveguides (ZMWs), embedded with only one set of enzymes and DNA template that can be detected during the whole process. During the reaction, the enzyme will incorporate the nucleotide into the complementary strand and cleave off the fluorescent dye previously linked with the nucleotide. Then the camera inside the machine will capture signal in a movie format in real-time observation [19]. This will give out not only the fluorescent signal but also the signal difference along time, which may be useful for the prediction of structural variance in the sequence, especially useful in epigenetic studies such as DNA methlyation [22].

Comparing to second generation, PacBio RS (the first sequencer launched by PacBio) has several advantages. First the sample preparation is very fast it takes 4 to 6 hours instead of days. Also it does not need PCR step in the preparation step, which reduces bias and error caused by PCR. Second, the turnover rate is quite fast runs are finished within a day. Third, the average read length is 1300 bp, which is longer than that of any second-generation sequencing technology. Although the throughput of the PacBioRS is lower than second-generation sequencer, this technology is quite useful for clinical laboratories, especially for microbiology research. A paper has been published using PacBio RS on the Haitian cholera outbreak [19].

We have run a de novo assembly of DNA fosmid sample from Oyster with PacBio RS in standard sequencing mode (using LPR chemistry and SMRTcells instead of the new version FCR chemistry and SMRTcells). An SMRT belt template with mean insert size of 7500 kb is made and run in one SMRT cell and a 120-minute movie is taken. After Post-QC filter, 22,373,400 bp reads in 6754 reads (average 2,566 bp) were sequenced with the average Read Score of 0.819. The Coverage is 324x with mean read score of 0.861 and high accuracy (

99.95). The result is exhibited in Figure 2.

Sequencing of a fosmid DNA using Pacific Biosciences sequencer. With coverage, the accuracy could be above 97%. The figure was constructed by BGI’s own data.

Nanopore sequencing is another method of the third generation sequencing. Nanopore is a tiny biopore with diameter in nanoscale [23], which can be found in protein channel embedded on lipid bilayer which facilitates ion exchange. Because of the biological role of nanopore, any particle movement can disrupt the voltage across the channel. The core concept of nanopore sequencing involves putting a thread of single-stranded DNA across α-haemolysin (αHL) pore. αHL, a 33 kD protein isolated from Staphylococcus aureus [20], undergoes self-assembly to form a heptameric transmembrane channel [23]. It can tolerate extraordinary voltage up to 100 mV with current 100 pA [20]. This unique property supports its role as building block of nanopore. In nanopore sequencing, an ionic flow is applied continuously. Current disruption is simply detected by standard electrophysiological technique. Readout is relied on the size difference between all deoxyribonucleoside monophosphate (dNMP). Thus, for given dNMP, characteristic current modulation is shown for discrimination. Ionic current is resumed after trapped nucleotide entirely squeezing out.

Nanopore sequencing possesses a number of fruitful advantages over existing commercialized next-generation sequencing technologies. Firstly, it potentially reaches long read length >5 kbp with speed 1 bp/ns [19]. Moreover, detection of bases is fluorescent tag-free. Thirdly, except the use of exonuclease for holding up ssDNA and nucleotide cleavage [24], involvement of enzyme is remarkably obviated in nanopore sequencing [22]. This implies that nanopore sequencing is less sensitive to temperature throughout the sequencing reaction and reliable outcome can be maintained. Fourthly, instead of sequencing DNA during polymerization, single DNA strands are sequenced through nanopore by means of DNA strand depolymerization. Hence, hand-on time for sample preparation such as cloning and amplification steps can be shortened significantly.

6. Discussion of NGS Applications

Fast progress in DNA sequencing technology has made for a substantial reduction in costs and a substantial increase in throughput and accuracy. With more and more organisms being sequenced, a flood of genetic data is inundating the world every day. Progress in genomics has been moving steadily forward due to a revolution in sequencing technology. Additionally, other of types-large scale studies in exomics, metagenomics, epigenomics, and transcriptomics all become reality. Not only do these studies provide the knowledge for basic research, but also they afford immediate application benefits. Scientists across many fields are utilizing these data for the development of better-thriving crops and crop yields and livestock and improved diagnostics, prognostics, and therapies for cancer and other complex diseases.

BGI is on the cutting edge of translating genomics research into molecular breeding and disease association studies with belief that agriculture, medicine, drug development, and clinical treatment will eventually enter a new stage for more detailed understanding of the genetic components of all the organisms. BGI is primarily focused on three projects. (1) The Million Species/Varieties Genomes Project, aims to sequence a million economically and scientifically important plants, animals, and model organisms, including different breeds, varieties, and strains. This project is best represented by our sequencing of the genomes of the Giant panda, potato, macaca, and others, along with multiple resequencing projects. (2) The Million Human Genomes Project focuses on large-scale population and association studies that use whole-genome or whole-exome sequencing strategies. (3) The Million Eco-System Genomes Project has the objective of sequencing the metagenome and cultured microbiome of several different environments, including microenvironments within the human body [25]. Together they are called 3 M project.

In the following part, each of the following aspects of applications including de novo sequencing, mate-pair, whole genome or target-region resequencing, small RNA, transcriptome, RNA seq, epigenomics, and metagenomics, is briefly summarized.

In DNA de novo sequencing, the library with insert size below 800 bp is defined as DNA short fragment library, and it is usually applied in de novo and resequencing research. Skovgaard et al. [26] have applied a combination method of WGS (whole-genome sequencing) and genome copy number analysis to identify the mutations which could suppress the growth deficiency imposed by excessive initiations from the E. coli origin of replication, oriC.

Mate-pair library sequencing is significant beneficial for de novo sequencing, because the method could decrease gap region and extend scaffold length. Reinhardt et al. [27] developed a novel method for de novo genome assembly by analyzing sequencing data from high-throughput short read sequencing technology. They assembled genomes into large scaffolds at a fraction of the traditional cost and without using reference sequence. The assembly of one sample yielded an N50 scaffold size of 531,821 bp with >75% of the predicted genome covered by scaffolds over 100,000 bp.

Whole genome resequencing sequenced the complete DNA sequence of an organism’s genome including the whole chromosomal DNA at a single time and alignment with the reference sequence. Mills et al. [28] constructed a map of unbalanced SVs (genomic structural variants) based on whole genome DNA sequencing data from 185 human genomes with SOLiD platform the map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications [28]. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact [28].

The whole genome resequencing is an effective way to study the functional gene, but the high cost and massive data are the main problem for most researchers. Target region sequencing is a solution to solve it. Microarray capture is a popular way of target region sequencing, which uses hybridization to arrays containing synthetic oligo-nucleotides matching the target DNA sequencing. Gnirke et al. [29] developed a captured method that uses an RNA “baits” to capture target DNA fragments from the “pond” and then uses the Illumina platform to read out the sequence. About 90% of uniquely aligning bases fell on or near bait sequence up to 50% lay on exons proper [29].

Fehniger et al. used two platforms, Illumina GA and ABI SOLiD, to define the miRNA transcriptomes of resting and cytokine-activated primary murine NK (natural killer) cells [30]. The identified 302 known and 21 novel mature miRNAs were analyzed by unique bioinformatics pipeline from small RNA libraries of NK cell. These miRNAs are overexpressed in broad range and exhibit isomiR complexity, and a subset is differentially expressed following cytokine activation, which were the clue to identify the identification of miRNAs by the Illumina GA and SOLiD instruments [30].

The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other noncoding RNA produced in one or a population of cells. In these years, next-generation sequencing technology is used to study the transcriptome compares with DNA microarray technology in the past. The S. mediterranea transcriptome could be sequenced by an efficient sequencing strategy which designed by Adamidi et al. [31]. The catalog of assembled transcripts and the identified peptides in this study dramatically expand and refine planarian gene annotation, which is demonstrated by validation of several previously unknown transcripts with stem cell-dependent expression patterns.

RNA-seq is a new method in RNA sequencing to study mRNA expression. It is similar to transcriptome sequencing in sample preparation, except the enzyme. In order to estimate the technical variance, Marioni et al. [32] analyzed a kidney RNA samples on both Illumina platform and Affymetrix arrays. The additional analyses such as low-expressed genes, alternative splice variants, and novel transcripts were found on Illumina platform. Bradford et al. [33] compared the data of RNA-seq library on the SOLiD platform and Affymetrix Exon 1.0ST arrays and found a high degree of correspondence between the two platforms in terms of exon-level fold changes and detection. And the greatest detection correspondence was seen when the background error rate is extremely low in RNA-seq. The difference between RNA-seq and transcriptome on SOLiD is not so obvious as Illumina.

There are two kinds of application of epigenetic, Chromatin immunoprecipitation and methylation analysis. Chromatin immunoprecipitation (ChIP) is an immunoprecipitation technique which is used to study the interaction between protein and DNA in a cell, and the histone modifies would be found by the specific location in genome. Based on next-generation sequencing technology, Johnson et al. [34] developed a large-scale chromatin immunoprecipitation assay to identify motif, especially noncanonical NRSF-binding motif. The data displays sharp resolution of binding position (±50 bp), which is important to infer new candidate interaction for the high sensitivity and specificity (ROC (receiver operator characteristic) area ≥0.96) and statistical confidence (

< 10–4). Another important application in epigenetic is DNA methylation analysis. DNA methylation exists typically in vertebrates at CpG sites the methylation caused the conversion of the cytosine to 5-methylcytosine. Chung presented a whole methylome sequencing to study the difference between two kinds of bisulfite conversion methods (in solution versus in gel) by SOLiD platform [35].

The world class genome projects include the 1000 genome project, and the human ENCODE project, the human Microbiome (HMP) project, to name a few. BGI takes an active role in these and many more ongoing projects like 1000 Animal and Plant Genome project, the MetaHIT project, Yanhuang project, LUCAMP (Diabetes-associated Genes and Variations Study), ICGC (international cancer genome project), Ancient human genome, 1000 Mendelian Disorders Project, Genome 10 K Project, and so forth [25]. These internationally collaborated genome projects greatly enhanced genomics study and applications in healthcare and other fields.

To manage multiple projects including large and complex ones with up to tens of thousands of samples, a superior and sophisticated project management system is required handling information processing from the very beginning of sample labeling and storage to library construction, multiplexing, sequencing, and informatics analysis. Research-oriented bioinformatics analysis and followup experiment processed are not included. Although automation techniques’ adoption has greatly simplified bioexperiment human interferences, all other procedures carried out by human power have to be managed. BGI has developed BMS system and Cloud service for efficient information exchange and project management. The behavior management mainly follows Japan 5S onsite model. Additionally, BGI has passed ISO9001 and CSPro (authorized by Illumina) QC system and is currently taking (Clinical Laboratory Improvement Amendments) CLIA and (American Society for Histocompatibility and Immunogenetics) AShI tests. Quick, standard, and open reflection system guarantees an efficient troubleshooting pathway and high performance, for example, instrument design failure of Truseq v3 flowcell resulting in bubble appearance (which is defined as “bottom-middle-swatch” phenomenon by Illumina) and random

in reads. This potentially hazards sequencing quality, GC composition as well as throughput. It not only effects a small area where the bubble locates resulting in reading

but also effects the focus of the place nearby, including the whole swatch, and the adjacent swatch. Filtering parameters have to be determined to ensure quality raw data for bioinformatics processing. Lead by the NGS tech group, joint meetings were called for analyzing and troubleshooting this problem, to discuss strategies to best minimize effect in terms of cost and project time, to construct communication channel, to statistically summarize compensation, in order to provide best project management strategies in this time. Some reagent QC examples are summaried in Liu et al. [36].

BGI is establishing their cloud services. Combined with advanced NGS technologies with multiple choices, a plug-and-run informatics service is handy and affordable. A series of softwares are available including BLAST, SOAP, and SOAP SNP for sequence alignment and pipelines for RNAseq data. Also SNP calling programs such as Hecate and Gaea are about to be released. Big-data studies from the whole spectrum of life and biomedical sciences now can be shared and published on a new journal GigaSicence cofounded by BGI and Biomed Central. It has a novel publication format: each piece of data links to a standard manuscript publication with an extensive database which hosts all associated data, data analysis tools, and cloud-computing resources. The scope covers not just omic type data and the fields of high-throughput biology currently serviced by large public repositories but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology, and other new types of large-scale sharable data.


  1. G. M. Church and W. Gilbert, “Genomic sequencing,” Proceedings of the National Academy of Sciences of the United States of America, vol. 81, no. 7, pp. 1991–1995, 1984. View at: Google Scholar .
  2. F. S. Collins, M. Morgan, and A. Patrinos, “The Human Genome Project: lessons from large-scale biology,” Science, vol. 300, no. 5617, pp. 286–290, 2003. View at: Publisher Site | Google Scholar . .
  3. J. Berka, Y. J. Chen, J. H. Leamon et al., “Bead emulsion nucleic acid amplification,” U.S. Patent Application, 2005. View at: Google Scholar
  4. T. Foehlich et al., “High-throughput nucleic acid analysis,” U.S. Patent, 2010. View at: Google Scholar . .
  5. E. R. Mardis, “The impact of next-generation sequencing technology on genetics,” Trends in Genetics, vol. 24, no. 3, pp. 133–141, 2008. View at: Publisher Site | Google Scholar
  6. S. M. Huse, J. A. Huber, H. G. Morrison, M. L. Sogin, and D. M. Welch, “Accuracy and quality of massively parallel DNA pyrosequencing,” Genome Biology, vol. 8, no. 7, article R143, 2007. View at: Publisher Site | Google Scholar
  7. “The new GS junior sequencer,” View at: Google Scholar
  8. “SOLiD system accuray,” View at: Google Scholar .
  9. B. A. Flusberg, D. R. Webster, J. H. Lee et al., “Direct detection of DNA methylation during single-molecule, real-time sequencing,” Nature Methods, vol. 7, no. 6, pp. 461–465, 2010. View at: Publisher Site | Google Scholar
  10. A. Mellmann, D. Harmsen, C. A. Cummings et al., “Prospective genomic characterization of the german enterohemorrhagic Escherichia coli O104:H4 outbreak by rapid next generation sequencing technology,” PLoS ONE, vol. 6, no. 7, Article ID e22751, 2011. View at: Publisher Site | Google Scholar
  11. H. Rohde, J. Qin, Y. Cui et al., “Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4,” New England Journal of Medicine, vol. 365, no. 8, pp. 718–724, 2011. View at: Google Scholar
  12. C. S. Chin, J. Sorenson, J. B. Harris et al., “The origin of the Haitian cholera outbreak strain,” New England Journal of Medicine, vol. 364, no. 1, pp. 33–42, 2011. View at: Publisher Site | Google Scholar
  13. W. Timp, U. M. Mirsaidov, D. Wang, J. Comer, A. Aksimentiev, and G. Timp, “Nanopore sequencing: electrical measurements of the code of life,” IEEE Transactions on Nanotechnology, vol. 9, no. 3, pp. 281–294, 2010. View at: Publisher Site | Google Scholar
  14. D. W. Deamer and M. Akeson, “Nanopores and nucleic acids: prospects for ultrarapid sequencing,” Trends in Biotechnology, vol. 18, no. 4, pp. 147–151, 2000. View at: Publisher Site | Google Scholar
  15. “Performance comparison of whole-genome sequencing systems,” Nature Biotechnology, vol. 30, pp. 78–82, 2012. View at: Google Scholar
  16. D. Branton, D. W. Deamer, A. Marziali et al., “The potential and challenges of nanopore sequencing,” Nature Biotechnology, vol. 26, no. 10, pp. 1146–1153, 2008. View at: Google Scholar
  17. L. Song, M. R. Hobaugh, C. Shustak, S. Cheley, H. Bayley, and J. E. Gouaux, “Structure of staphylococcal α-hemolysin, a heptameric transmembrane pore,” Science, vol. 274, no. 5294, pp. 1859–1866, 1996. View at: Publisher Site | Google Scholar
  18. J. Clarke, H. C. Wu, L. Jayasinghe, A. Patel, S. Reid, and H. Bayley, “Continuous base identification for single-molecule nanopore DNA sequencing,” Nature Nanotechnology, vol. 4, no. 4, pp. 265–270, 2009. View at: Publisher Site | Google Scholar
  19. Website of BGI,
  20. O. Skovgaard, M. Bak, A. Lྋner-Olesen et al., “Genome-wide detection of chromosomal rearrangements, indels, and mutations in circular chromosomes by short read sequencing,” Genome Research, vol. 21, no. 8, pp. 1388–1393, 2011. View at: Google Scholar
  21. J. A. Reinhardt, D. A. Baltrus, M. T. Nishimura, W. R. Jeck, C. D. Jones, and J. L. Dangl, “De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae,” Genome Research, vol. 19, no. 2, pp. 294–305, 2009. View at: Publisher Site | Google Scholar
  22. R. E. Mills, K. Walter, C. Stewart et al., “Mapping copy number variation by population-scale genome sequencing,” Nature, vol. 470, no. 7332, pp. 59–65, 2011. View at: Publisher Site | Google Scholar
  23. A. Gnirke, A. Melnikov, J. Maguire et al., “Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing,” Nature Biotechnology, vol. 27, no. 2, pp. 182–189, 2009. View at: Publisher Site | Google Scholar
  24. T. A. Fehniger, T. Wylie, E. Germino et al., “Next-generation sequencing identifies the natural killer cell microRNA transcriptome,” Genome Research, vol. 20, no. 11, pp. 1590–1604, 2010. View at: Publisher Site | Google Scholar
  25. C. Adamidi, Y. Wang, D. Gruen et al., “De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics,” Genome Research, vol. 21, no. 7, pp. 1193–1200, 2011. View at: Publisher Site | Google Scholar
  26. J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad, “RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays,” Genome Research, vol. 18, no. 9, pp. 1509–1517, 2008. View at: Publisher Site | Google Scholar
  27. J. R. Bradford, Y. Hey, T. Yates, Y. Li, S. D. Pepper, and C. J. Miller, “A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling,” BMC Genomics, vol. 11, no. 1, article 282, 2010. View at: Publisher Site | Google Scholar
  28. D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold, “Genome-wide mapping of in vivo protein-DNA interactions,” Science, vol. 316, no. 5830, pp. 1497–1502, 2007. View at: Publisher Site | Google Scholar
  29. H. Gu, Z. D. Smith, C. Bock, P. Boyle, A. Gnirke, and A. Meissner, “Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling,” Nature Protocols, vol. 6, no. 4, pp. 468–481, 2011. View at: Publisher Site | Google Scholar
  30. L. Liu, N. Hu, B. Wang et al., “A brief utilization report on the Illumina HiSeq 2000 sequencer,” Mycology, vol. 2, no. 3, pp. 169–191, 2011. View at: Google Scholar


Copyright © 2012 Lin Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

What is Next Generation Sequencing?

In short, the same exact thing. Consider this definition:

"Next-generation sequencing refers to non-Sanger-based high-throughput DNA sequencing technologies. Millions or billions of DNA strands can be sequenced in parallel, yielding substantially more throughput and minimizing the need for the fragment-cloning methods that are often used in Sanger sequencing of genomes."

The two terms are used interchangeably. Historically, Next Generation Sequencing was used as the catchall term for high-throughput DNA sequencing. Over the past few years, however, the term MPS is beginning to have broader adoption among forensic scientists.

Ion semiconductor sequencing (Ion Torrent sequencing)

Unlike Illumina and 454, Ion torrent and Ion proton sequencing do not make use of optical signals. Instead, they exploit the fact that addition of a dNTP to a DNA polymer releases an H+ ion.

The sample of DNA to sequence is cut up into fragments. Adapters are used to capture single molecules of template onto microbeads by primer hybridization. Beads are incorporated into a carefully controlled emulsion, in which each bubble constitutes a microreactor containing DNA template, primer and reagents for PCR. Following amplification, each bead is coated with clonally amplified molecules.

Each bead flows accross a chip and is deposited into a well. The chip is flooded with one of the four dNTP. Whenever a nucleotide is incorporated into a single strand of DNA, a hydrogen ion is released. The Ion Torrent system reads this chemical change directly on the chip by following the pH variations of the solution in the well. An Ion sensitive layer beneath the well measures that change in pH and converts it to voltage. This voltage change is recorded indicating that the nucleotide was incorporated and the base was called. The voltage recorded is proportional to the amount of H+ released. The process is repeated every cycle with a different nucleotide washing over the chip.

  • Interrogating greater than 100 genes at a time cost effectively
  • Finding novel variants by expanding the number of targets sequenced in a single run.
  • Sequencing samples that have low input amounts of starting material, using, for example, Ion AmpliSeq library preparation, which requires as little as 10 ng of input DNA
  • Sequencing microbial genomes for pathogen subtyping to enable research of critical outbreak situations

The Sanger Method of DNA Sequencing (Note that at 0:30 there is an error, the primer anneal to the 3' end of the DNA, not the 5' as it is indicated in the video, because polymerase adds nucleotide to the 3' end of a DNA strand)

Watch the video: 1 Next Generation Sequencing NGS - An Introduction (January 2023).