Information

Assembly of metagenomic data

Assembly of metagenomic data


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm trying to assemble metagenomic data that comes from termite guts. The sequences comes from SOLiD and no paired, so the reads are extremely short (25bp).

I have tried multiple assemblers (CLC, velvet, metavelvet, Meta-IDBA) but all produce few contigs (15 contigs, 1000bp average). The contig output is scarce considering the amount of raw data (5 Gpb).

Has anybody has any success in a task such as this?.

Thanks.


This depends on your coverage and the number and relative proportion of species in the mixture. this seems unlikely to produce results unless the protocol biases the library (rRNA universal primers for instance.) I think at 25 bp sequences, even 30x coverage would not give full assembly sequences. Typically I believe 25 bp reads are only used to resequence close variants from similar or reference genomes.


Metagenomics

Metagenomics is the study of genetic material recovered directly from environmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics.

While traditional microbiology and microbial genome sequencing and genomics rely upon cultivated clonal cultures, early environmental gene sequencing cloned specific genes (often the 16S rRNA gene) to produce a profile of diversity in a natural sample. Such work revealed that the vast majority of microbial biodiversity had been missed by cultivation-based methods. [2]

Because of its ability to reveal the previously hidden diversity of microscopic life, metagenomics offers a powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the entire living world. [3] As the price of DNA sequencing continues to fall, metagenomics now allows microbial ecology to be investigated at a much greater scale and detail than before. Recent studies use either "shotgun" or PCR directed sequencing to get largely unbiased samples of all genes from all the members of the sampled communities. [4]


Methods

Here, we describe MetAMOS [10], an open-source, modular assembly pipeline built upon AMOS and tailored specifically for metagenomic next-generation sequencing data. MetAMOS is the first step toward a fully automated assembly and analysis pipeline, from mated reads (Illumina and 454) to scaffolds and ORFs. Currently, MetAMOS has support for four assemblers (SOAPdenovo [11], Newbler, CABOG and Minimus [12]), three annotation methods (BLAST, PhymmBL and MetaPhyler), two metagenomic gene prediction tools (MetaGeneMark and Glimmer-MG) and one unitig scaffolder engineered specifically for metagenomic data (Bambus 2). We also provide a novel graph-based algorithm to propagate annotations rapidly to all contigs in an assembly using, for example, only the largest contigs or contigs with high-confidence classification. MetAMOS has three principal outputs: subdirectories containing FASTA sequence of the contigs/scaffolds/ variant motifs belonging to a specified taxonomic level, a collection of all unclassified/potentially novel contigs contained in the assembly, and an HTML report with detailed assembly statistics and summary charts.


Introduction

The increased scientific and practical interest in the microbial world that surrounds us, as well as the emergence of new molecular-biological and bioinformatic approaches for the analysis of the diversity and genetic potential of microbial communities from diverse environments gave rise to what is now known as metagenomics.

While partial sequence of microbiotal DNA (community of microorganisms) is sufficient to assess the information about the diversity of the sampled community, to uncover the genetic potential, we need to analyze the extended genomic regions, or even better, fully restored genomes from the microbiome (combined genome of the microbiota). These types of extended regions can be obtained by assembling short DNA fragments produced by modern sequencing technologies.

Assembling a genome is a difficult task both due to the complexity of the specific genomes and the many particularities of the sequencing technologies used to accomplish this goal. In the case of metagenomic data, this task is further complicated by (1) the large volume of data produced (2) the quality of the sequence (3) the unequal representation of members of the microbial community (4) the presence of closely related microorganisms with similar genomes (5) the presence of several strains of the same microorganism and (6) an insufficient amount of data for minor community members (Kunin et al., 2008 Sczyrba et al., 2017).

To overcome (or at least partially solve) these problems, a number of various approaches and analytical pipelines have been created and applied, with new ones being continuously worked on (Ayling et al., 2020).

Here, we describe the challenges of metagenomic assembly (MA), the role MA played in the revolutionary discoveries of recent years, expanding our knowledge of the microbial world of our planet.


Results

Subsampling

In the present study, the data released by Nicholls et al. 15 (ultra-deep sequencing of two different mock communities using GridION and PromethION platforms) was used in order to study the suitability of nanopore sequencing to characterize low complex microbial communities. The mock communities were composed by the same ten microorganisms, but in different proportions (Table 1). With the aim of reducing the computational resources needed for the first screening of the selected assemblers, the GridION and PromethION datasets were subsampled to obtain an output comparable with recent genomic or metagenomic studies based on MinION (approximately 3 Gbp and 6 Gbp) 9,34,35,36,37,38,39 . In general, mean read length remained the same in the subsampled datasets in comparison to the original sequencing data 15 . However, read quality was higher in the subsampled dataset. This fact suggested a bias towards higher qualities at the start of the run, since subsampling was carried out by selecting the top reads of the original files (Table 2). In fact, the bottom reads which are acquired later in the sequencing run displayed the same quality than the whole dataset.

Metagenome assembly

From the selected tools, we were able to correctly install and run nine out of the ten long-read assemblers, and two out of the three short-read assemblers (Supplementary Table S1). In total, 74 assemblies were generated, 40 for the Even mock community and 34 for the Log community. Six assemblies could not be completed because miniasm and Pomoxis failed to run with the 6 Gbp Log datasets, whereas Unicycler failed to run with the 3 Gbp Log datasets. The total size of each draft assembly and the fraction of metagenome recovered from the reference genomes were evaluated for the Even datasets in order to obtain a first view of the general tool performance.

Overall, long-read assemblers resulted in a total assembly size closer to the theoretical size, and also recovered a larger metagenome fraction, with some exceptions (Fig. 1). Nevertheless, large differences were detected for both metrics among the assemblers. All the assemblers were far from recovering the totality of the metagenome, both in the 3 Gbp and the 6 Gbp datasets (Fig. 1A). It must be noted that metaQUAST and minimap2 results were consistent for the long-read assemblers, but not for the short-read assemblers, where minimap2 metric was significantly higher (Fig. 1B). MetaFlye (both versions) yielded the best assemblies in terms of total metagenome size and metagenome recovery except for the minimap2 metric, followed by Pomoxis, Canu and Raven (previously known as Ra). Interestingly, assembly pipelines based on the miniasm algorithm (Pomoxis, Unycicler, and miniasm itself) presented huge variations in their performance. Unicycler and miniasm performed relatively well for the 3 Gbp dataset, but when using 6 Gb, the final assembly did not improve significantly in the case of miniasm, and the general performance was highly reduced for Unicycler. This is in contrast to Pomoxis, which produced the second most complete assemblies with both dataset sizes. Although based on miniasm, it is worth highlighting that Unicycler’s pipeline is designed for single isolate assembly, so reduced performance was expected for metagenomic studies. Finally, Redbean (previously known as wtdbg2) and Shasta resulted in poor assembly performance in comparison to the other long-read tools.

Evaluation of metagenome assembly size corresponding to each tested tool for the subsampled Even datasets. (A) Total assembled size of draft assemblies with respect to the total size of the reference metagenome (B) fraction of the reference metagenome covered by the draft assembly, calculated by two different methods: metaQUAST (top panel) and minimap2 + BBTools (bottom panel).

MetaQUAST was used for further evaluating the degree of completeness of each individual draft genome (Fig. 2). As expected, yeast genomes were generally less recovered than bacterial ones, due to their lower abundance (2%) and higher size, explaining the low metagenome fraction generally recovered by all the assemblers (Fig. 1). In fact, the maximum average recovery fraction for the bacterial genomes was 99.92% (Supplementary Fig. S1). Minia and Megahit were not able to recover any single genome with high completeness (> 95% of genome coverage) in any dataset. For the 3 Gbp dataset, metaFlye (both versions) and Unicycler recovered the eight bacterial genomes with a high completeness level (> 98.6%), while Pomoxis achieved lower recovery fractions for two genomes (

96.9 to 97.4%). Raven and Canu resulted in reduced recovery percentages, but still retrieved all the prokaryotic genomes with a mean covered fraction greater than 85% and 87%, respectively. Redbean and Shasta achieved particularly low fractions of genome recovery.

Fraction of the genome covered by the draft assemblies obtained using each tool, and for each individual microorganism (subsampled Even datasets). Miniasm assemblies are not shown, since it was not possible to evaluate them with metaQUAST.

For the 6 Gbp dataset, Unicycler performance decreased substantially as noted in Fig. 1, while Canu, Pomoxis, Raven and metaFlye achieved similar or better results. In general, metaFlye displayed the best performance on both dataset sizes in terms of genome recovery, closely followed by Pomoxis. This trend was also observed when analyzing the proportion of yeast genomes recovered by each tool. In this context, it is important to highlight that metaFlye’s ability to recover eukaryotic genomes was reduced when using metaFlye v2.7. This is due to the lower number of missassemblies retrieved by this metaFlye version, indicating that the reduced fraction of genome recovery is compensated with more reliable assemblies (Supplementary Fig. S2).

These results were confirmed when analyzing the Log mock community (Supplementary Fig. S3). Canu, metaFlye, Raven and Pomoxis were able to recover Listeria monocytogenes and Pseudomonas aeruginosa genomes (89.1% and 8.9% of total genomic DNA in the Log mock community, respectively) with a level of completeness higher than 99%. These assemblers also recovered a significant fraction of Bacillus subtilis (0.89% of total genomic DNA in the Log mock community). In fact, Raven was able to reconstruct > 99% of its genome using the 6 Gbp datasets, whereas metaFlye recovered

98%. In this case, both tools outperformed Canu. Nevertheless, Raven did not recover a significant fraction of Saccharomyces cerevisiae, whereas Canu and metaFlye did (> 8%). Pomoxis worked correctly when using the 3 Gbp datasets, but failed to run with both 6 Gbp files. The other tools based on the miniasm algorithm also failed to run the 3 Gbp (Unicycler) and/or 6 Gbp datasets (miniasm). In all cases, the error was related to memory usage and accession (segmentation violation), and could not be solved. Nevertheless, using a computer with more RAM would help to easily overcome this problem. Shasta, RedBean, Minia and Megahit performed poorly in comparison to the other tools (Supplementary Fig. S3). It has to be noted that Shasta and RedBean were not originally designed to work with metagenomic data, which could result in problems to handle uneven coverages.

Regarding the time consumed by each tool, Shasta was the fastest assembler (Fig. 3A). This tool was able to assemble the 6 Gbp datasets in only 285 s, approximately. RedBean and miniasm were the second and third most fast software, followed by Raven (1.5–1.9 times faster than metaFlye v2.7). MetaFlye was 1.4–1.7 times faster than Pomoxis, and 3.8–5.5 times faster than Canu, which proved to be the slowest tool. These trends were also found in the Log mock community (Supplementary Fig. S4), where Canu spent up to 22 h reconstructing a draft metagenome assembly from the 6 Gbp datasets. In this case, Raven was faster than metaFlye v2.7 for the 3 Gbp datasets, but not for the 6 Gbp ones.

General assembly performance of each tool for the subsampled Even datasets. (A) Run time (B) N50 (C) number of contigs (D) L50.

General metagenome statistics (N50, L50, and number of contigs) were evaluated using QUAST (Fig. 3 Supplementary Table S3). It has to be stressed that the comparisons based on these metrics are difficult to analyze due to the large variation in the general performance among the different assemblers. For instance, Shasta resulted in the highest N50 and the lowest L50 values for the 6 Gbp dataset, but this tool was able to cover less than 35% of the metagenome. In fact, the total assembly size for Shasta was approximately 18–21 Mbp, in comparison to the 49–53 Mbp assembled by metaFlye.

As expected, short-read assemblers did not perform well with nanopore data, resulting in thousands (Minia), or even hundreds of thousands of contigs (Megahit). Interestingly, long-read assemblers resulted in more fragmented draft genomes when using the 6 Gbp datasets. Except for Shasta, the other long-read assemblers also reduced their N50 and increased their L50 and number of contigs score when using 6 Gbp. Goldstein et al. 9 demonstrated that Canu assemblies improved with higher coverage when assembling bacterial isolates. This fact suggests that the loss of contiguity detected may be a direct consequence of a higher recovery rate of yeast genomes, which might be more fragmented. Indeed, assembly statistics of the Canu draft assemblies remained almost the same for the bacterial species when using 3 or 6 Gbp (Supplementary Table S4). Finally, metaFlye and Raven resulted in a more contiguous assembly with higher N50 and lower L50 in comparison to the other best performing tools (Canu and Pomoxis), for both 3 and 6 Gbp datasets (Fig. 3 Supplementary Table S3). Remarkably, metaFlye v2.7 yielded slightly better results than metaFlye v2.4 (Fig. 3B–D), and required less time (Fig. 3A).

ONT hardware, protocols and software are in constant development, leading to large improvements in short periods of time. Recently, an optimized DNA extraction and purification methodology has allowed to reach an average yield of

15.9 Gbp per flowcell 46 . For that reason, we decided to run the most promising assemblers directly on GridION’s original data (Even mock community 14 Gbp). RedBean was included because of its computational efficiency, which is a key factor for the analysis of deeply sequenced microbiomes. Results were similar to those obtained for the 3 and 6 Gbp (Fig. 4). Canu recovered the highest proportion of bacterial genomes, closely followed by metaFlye. Raven, once again, displayed problems when reconstructing the whole Escherichia coli and Salmonella enterica genomes, an issue also detected for RedBean in a more notable way. MetaFlye and Raven achieved a better recovery ratio than Canu for the yeast genomes. Overall, metaFlye genomes were more complete but less contiguous than the Raven draft assemblies, which presented a lower number of contigs for all the species with the exception of E. coli and S. enterica (Fig. 4B). This trend was also observed for the Log datasets (Supplementary Fig. S4). Remarkably, Raven was able to assemble two bacterial genomes in only one contig (Lactobacillus fermentum and P. aeruginosa), and retrieved four additional genomes in only 2–3 contigs. Finally, it was not possible to run Pomoxis on this dataset because of the unsolvable error previously described.

Even GridION (14 Gbp) assembly evaluation for the best performing tools. (A) Fraction of the genome covered by draft assemblies (B) number of contigs for each microorganism.

Assembly accuracy

Sequencing errors are the biggest drawback of third generation sequencing platforms. These errors can reach the final assemblies, resulting in lower quality draft genomes. In order to evaluate how the different assemblers handle the specific error profile of ONT platforms, we analysed the total number of SNPs and indels present in each draft metagenome. As described in Methods, two different and complementary strategies were used to quantify these types of errors: (1) minimap2 + bcftools, and (2) MuMmer (Fig. 5). Both strategies relied on the alignment of the draft assemblies to the reference metagenome, composed by a mix of all the complete genomes of each strain present in the mock community.

Assembly accuracy for the draft assemblies (subsampled Even datasets). (A) Percentage of similarity calculated as the total number of matches normalized by the metagenome size (B) percentage of indels calculated as the total number of indels normalized by the metagenome size. In both cases, two different strategies were used: (top panel) alignment with minimap and evaluation with bcftools + ‘indels_and_snps.py’ in-house script (bottom panel) alignment with MuMMer and evaluation with ‘count_SNPS_indels.pl’ script from Goldstein et al. 9 .

Results were not fully consistent between the two methodologies, especially for the indels estimation, but they still showed similar trends. All the long-read assemblers retrieved draft metagenomes with an average similarity higher than

98.9%, with the exception of miniasm, which resulted in an approximate accuracy of only 96%. This low accuracy could explain the inability of metaQUAST to evaluate miniasm assemblies. It has to be noted that the other pipelines based on miniasm, Pomoxis and Unicycler, incorporated several rounds of polishing via Racon 45 , which substantially reduced the number of SNPs and indels in the final draft assembly (see below).

Canu displayed a higher percentage of similarity for both methodologies and datasets, followed by Unicycler for the 3 Gbp dataset, and Shasta for the 6 Gbp one. Pomoxis, metaFlye, and Raven presented similarities over 99.5%. In the case of the indel profile, Unicycler and metaFlye v2.7 clearly outperformed Canu. Raven and Pomoxis also achieved a better indel ratio than Canu, except for the 6 Gbp dataset and the bcftools metric. Redbean, miniasm, and Shasta results were inconsistent between the two methodologies tested (Fig. 5).

Biosynthetic gene cluster prediction

Gene prediction is highly affected by genome contiguity, completeness and accuracy. BGCs are especially influenced by these factors, since they are usually found in repetitive regions which are often poorly assembled. AntiSMASH was used to assess the number of clusters found in the draft assemblies retrieved by each tool in comparison to the reference metagenome with the aim of evaluating BGC prediction on nanopore-based metagenomic assemblies (Fig. 6). As expected, none of the tools recovered the entire BCG profile, since metagenomes were not completely reconstructed (Fig. 1). Using the entire GridION dataset (14 Gbp) did not improve the number of BCGs recovered (Supplementary Table S5). Overall, when considering the total number of BGCs predicted and the similarity of the obtained profile compared to the reference profile, Raven displayed the best performance for both 3 Gbp datasets, whereas metaFlye v2.7 displayed the best performance for the 6 Gbp datasets. Pomoxis also achieved good predictions, outperforming Canu. All the predicted profiles presented an enrichment in lasso peptides (ribosomally-synthesized short peptides), which were not present in the reference profile. To further study this phenomenon, lasso peptides predicted by the different tools were searched using BLAST against the BGCs predicted in the reference metagenome. No hits were found, suggesting that these results might be prediction artifacts mainly caused by indels, which are probably introducing frameshift errors, and artificially increasing the number of short peptides being predicted (i.e. lasso peptides). In fact, metaFlye v2.7, which had a significantly lower indel ratio, retrieved fewer lasso peptides than metaFlye 2.4 (Fig. 5). We also corrected Pomoxis assemblies with Medaka, leading to a lower indel ratio (see the following section). Lasso peptides were not detected in Pomoxis + Medaka assemblies, highlighting the importance of indel correction for functional prediction (Supplementary Fig. S5).

Number of biosynthetic gene clusters (BGCs) predicted by antiSMASH for each draft assembly in the Even GridION datasets. (A) BGCs predicted using the 3 Gbp dataset (B) BGCs predicted using the 6 Gbp dataset.

Polishing evaluation

Polishing is the process of correcting assemblies in order to generate improved consensus sequences. Input for polishing nanopore-based assemblies can be raw ONT reads (i.e. Racon or Medaka) 45 , raw electric signal (i.e. Nanopolish) (https://github.com/jts/nanopolish), or even high quality short reads (i.e. Racon) 45 . The state-of-art polishing workflow for nanopore sequencing consists of correcting the draft assemblies through several rounds of Racon (typically 2–4), followed by a single Medaka step.

Some of the tested tools automatically incorporated Racon (Raven, Pomoxis and Unicylcer) in their pipelines, whereas the others included different algorithms for correcting the reads before (Canu) or after (metaFlye and ReadBean) the assembly process. Thus, we wanted to assess how various steps of polishing could affect the SNP and indel ratio of the different assemblers. Results were highly heterogenous (Fig. 7 Supplementary Table S6). Pomoxis and Raven drastically improved their accuracy after several rounds of polishing with the original ONT reads (Supplementary Table S6). In fact, accuracy with no polishing steps was close to 96%, as reported for miniasm (Fig. 5). Higher similarity percentages were observed after one round of Racon (1R) for Raven, and four rounds of Racon + one round of Medaka (4R + m) for Pomoxis. Redbean and metaFlye -which were run again without using their built-in polishers- also improved their accuracy after 1R or 4R + m, respectively. Canu presented a lower percentage of SNPs when no polishing steps were added to the pipeline (Supplementary Table S6). Nevertheless, all the tools drastically improved their indel ratio after 4R + m. The percentage of improvement varied between 41% (Canu) and 91% (Raven and Pomoxis) (Fig. 7A). It has to be highlighted that the lowest number of SNPs and indels was achieved by Canu, which is the only tool that carries out error correction before assembling the reads.

Polishing evaluation. (A) Percentage of improvement within the whole metagenome, taking as a reference the number of errors prior to polishing (B) highest similarity percentage achieved by each tool (C) best indel ratio achieved by each tool. Note that a different number of polishing rounds may be needed for achieving the highest similarity and the lowest indel ratio depending on the tool.

The error profiles were evaluated again to further assess whether polishing draft assemblies with high quality short reads led to improved assemblies. Albeit yielding heterogeneous results, all the tools achieved better indel ratios after four rounds of Racon correction with Illumina reads (Supplementary Table S6). In this case, all the assemblers improved their accuracy (% of similarity) after one (Canu and metaFlye) or four (Pomoxis, Raven and RedBean) Racon rounds. When comparing the highest scores obtained with Illumina-based correction to the highest scores achieved after ONT-based polishing (Fig. 7), the percentage of similarity was higher for metaFlye and Canu assemblies corrected with Illumina reads, and lower for Pomoxis, Raven and RedBean, where ONT polishing outperformed Illumina’s. A similar trend was observed for the indel ratio. This time, Illumina correction clearly enhanced the indel correction for metaFlye and Canu. In fact, Canu + Illumina correction retrieved the lowest indel ratio. Pomoxis, Raven and RedBean achieved a better indel correction with ONT reads.


Assembly pipelines

Above, we described metaSPAdes as a pipeline for metagenomic assembly that incorporates SPAdes. A number of other software pipelines are available that combine read pre-processing, metagenomic assemblers and post-assembly analysis. Perhaps, the most comprehensive example is MetAMOS [ 55], which, at the time of writing, supports almost 20 genomic and metagenomic assemblers, along with a wide range of pre-processing, filtering, validation and annotation tools. Users can create workflows containing combinations of the tools that are suited to their data sets.

InteMAP [ 56] integrates output from two dBG assemblers (ABySS, IDBA-UD) and one OLC assembler (Celera) by separately merging low and high coverage contigs from pairs of assemblers. The authors of EnsembleAssembler also argue that merging the output from dBG and OLC assemblies can produce improved results [ 57]. MetaCRAM [ 25] is focussed on efficient storage via compression of metagenomic data sets. It taxonomically classifies reads and then assembles unclassified reads using IDBA-UD. Both the aligned reads and the unaligned read assemblies are then compressed for storage.


Target Audience

Graduates, postgraduates, staff bioinformaticians and PIs working with or about to embark on analysis of marker genes, metagenomic, and metatranscriptomic data from microbiome-focused experiments.

Prerequisites: Basic familiarity with Linux environment and statistical analysis is required. Must be able to complete and understand the following simple Linux tutorial before attending:

You will also require your own laptop computer. Minimum requirements: 1024x768 screen resolution, 1.5GHz CPU, 2GB RAM, 10GB free disk space, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements). If you do not have access to your own computer, you may loan one from the CBW. Send us an email for more information.


Methods

Here, we describe MetAMOS [10], an open-source, modular assembly pipeline built upon AMOS and tailored specifically for metagenomic next-generation sequencing data. MetAMOS is the first step toward a fully automated assembly and analysis pipeline, from mated reads (Illumina and 454) to scaffolds and ORFs. Currently, MetAMOS has support for four assemblers (SOAPdenovo [11], Newbler, CABOG and Minimus [12]), three annotation methods (BLAST, PhymmBL and MetaPhyler), two metagenomic gene prediction tools (MetaGeneMark and Glimmer-MG) and one unitig scaffolder engineered specifically for metagenomic data (Bambus 2). We also provide a novel graph-based algorithm to propagate annotations rapidly to all contigs in an assembly using, for example, only the largest contigs or contigs with high-confidence classification. MetAMOS has three principal outputs: subdirectories containing FASTA sequence of the contigs/scaffolds/ variant motifs belonging to a specified taxonomic level, a collection of all unclassified/potentially novel contigs contained in the assembly, and an HTML report with detailed assembly statistics and summary charts.


Materials and methods

Assembly validation

MUMmer [77] version 3.23 was used to align assembled contigs/scaffolds to the reference genomes (--maxmatch -l 20). When scaffolds were available, contigs were extracted by splitting the scaffolds at three or more consecutive Ns. For the scaffold analysis in the HMP tongue dorsum sample scaffolds were left intact. Only contigs/scaffolds over 150 bp were used for validation (unassembled reads did not count towards the total). Alignments were then filtered using 'delta-filter -i 97 -q' to only retain the best hits to the reference for each contig/scaffold. All statistics were calculated on the final set of filtered alignments using a custom validation script. A contig with an alignment to a single reference genome across its entire length (allowing for a ± 15 bp mismatch at the ends of the alignment) was considered a good contig. A contig with an alignment covering > 80% of the contig length but < 100% was considered a slight mis-assembly and still considered valid. A contig with single alignment covering less than < 80% of the contig length, multiple alignments to a single reference genome, or multiple alignments to multiple reference genomes were all considered as mis-assemblies (and in the case of alignments to multiple reference genomes, chimeric). For the HMP tongue dorsum dataset, contigs were allowed to have multiple alignments to a single reference and to align at lower identity (-i 90) due to the expected differences between the selected reference genome set and the actual genomes in the sample. None of the heavily mis-assembled contigs or chimeric contigs were used to calculate reference coverage statistics. The assembly validation scripts are available for download at [78]. For the scaffold analysis we only counted detectable chimera events as errors.

Annotation validation

The mock dataset annotations were generated using FCP. Each assembler was run within MetAMOS as described below and the assembled contigs (along with unassembled sequences) annotated using FCP. To establish a truth, sequences were mapped using Bowtie to the known references. The command 'bowtie -p 10 -f -l 25 -e 140 --best -k 1 -S' was used to pick the best genome for each sequence. Unmapped sequences were also recorded. The annotation results were compared to this truth at each taxonomic level using custom scripts. Finally, the MetAMOS propagation step was run using the class-level annotation and the results compared to the pre-propagation results.

Default MetAMOS parameters

The Preprocess step includes one external software program, FastQC, in addition to a custom filtration script for read pre-processing. By default, read pre-processing is disabled. If enabled (via the -t parameter), all reads containing low quality bases and Ns are aggressively discarded, which can result in 5 to 10% (or more) of the reads being discarded. If fastq files are available and FastQC is enabled (by default it is disabled), the following command is executed: fastqc -t < cpus > fastq_input_files. The assemble step currently supports eight assemblers. The default parameters/recipes for each assembler are available in the configuration files from the MetAMOS code repository and are listed below. The map reads step relies on the short read mapper bowtie to align the reads to the assembled contigs. The default bowtie command is: 'bowtie -l 25 -e 100 --best --strata -m 10 -k 1'. Alternatively, the user can select via the '-w' parameter to trim the reads to 25 bp and align with the following parameters: 'bowtie -v 1 -M 2'. Currently MetAMOS supports three metagenomic gene finders, MetaGeneMark, FragGeneScan, and Glimmer-MG. MetaGeneMark and FragGeneScan are run with default parameters. We rely on a utility script that runs Repeatoire to identify repetitive contigs and create multi-alignments of ORF families. Repeatoire parameters are, by default, set to: '--minreplen = 200 --z = 17'. By default, MetaPhyler is enabled to quickly estimate the abundance on the supplied metagenomic sample. The included version, MetaPhylerV1.13, relies on blastp. The blastp parameters used are: '-m 8 -b 10 -v 10'. No other parameters are required for running MetaPhyler. If annotate is enabled, FCP is used to annotate/classify contigs and predicted ORFs. The default parameters are used. In addition to FCP, we also support phmmer (-E 1.0e-10), PhyloSift ('all -threaded'), and PhymmBL (default program parameters). Bambus 2 is the metagenomic scaffolder included within MetAMOS and is also executed with default parameters (coverage cut-off is automatically calculated from the assembly graph).

Software packages and corresponding parameters used in our experiments

Program versions

All parameters used were default unless otherwise specified. The parameters below are the defaults within the MetAMOS pipeline for each tool. Modifications to default program parameters were the result of either a) recommendations from the program's author/user guide, b) published parameter settings on similar datasets, or c) empirical studies. SOAPdenovo, version 1.05 was run with the parameters '-D -d -R -M 3'. Velvet version 1.1.05 was run with 'k = 51'. Meta-IDBA version 0.19 with parameters '--mink 21 --maxk < user specified > --cover 1 --connect'. MetaVelvet version 1.1.01 with default parameters. Bowtie version 0.12.7 was run with '-l 25 -e 100 --best --strata -m 10 -k 1'. MetaGeneMark version 2.7d was run with default parameters. FragGeneScan version 1.16 was run with default parameters. FCP (nb-classify, epsilon-NB.py) version 1.0 was run with default parameters.

HMP mock experiment

For all experiments, the default MetAMOS parameters were used. For all assemblers, a k-mer of 51 was specified. For Bambus 2, a redundancy threshold of 10 was used.

HMP tongue dorsum experiment

For all experiments, the default MetAMOS parameters were used (except for specifying alternative assemblers with -a soap for SOAPdenovo and -a metaidba for Meta-IDBA). For all assemblers, a k-mer of 51 was specified. For Bambus 2, a redundancy threshold of 10 was used. The motif was aligned using web-based blastn against the nr database to identify top-scoring genes.

MetaHIT experiment/sexual dimorphism

We selected three males and three females randomly from the MetaHit project having the same age (59 years), the same country (Denmark), and the same enterotype (ET1) [7]. We also chose the samples to have approximately equal body mass index (26.19 for males versus 24.12 for females). The chosen samples were MH0041, MH0045, and MH0055 for males and MH0002, MH0024, and MH0082 for females. MetAMOS was run on all three samples of each sex using the longer paired libraries for each sample (ERR011181, ERR011189, ERR011209 for males and ERR011091, ERR011149, ERR011264 for females).

To test for concordance between pre- and post-assembly annotations, we selected the order level classifications and compared the percentage classified at each order in the pre- and post-assembly male and female samples independently. We used R (version 2.11.1) and the command cor.test(preAsm, postAsm) to estimate the concordance between pre- and post-assembly assignments. To test for significance of the difference between samples we used the Fisher exact test on the order, family, and genus compositions of the male and female samples with the R command fisher.test(x).

We ran two versions of MetaPhyler, one based on BLAST in addition to the new version based on MUMmer. The new MetaPhyler is significantly faster the new MetaPhyler ran in 12 CPU hours compared to 25 CPU hours for the post-assembly analysis (which used the original MetaPhyler).

For all experiments, the default MetAMOS parameters were used and a k-mer of 51 was specified. For Bambus 2, a redundancy threshold of 10 was used.

Datasets used in our experiments

HMP mock samples were part of the Human Microbiome Project Metagenomes Mock Pilot (BioProject ID: 48475) and available for download at [79].

HMP tongue dorsum sample was downloaded from the SRA [SRA:SRS077736].

MetaHIT human gut metagenome samples: *MH0041 [80] (run accession ERR011181) [81, 82] *MH0045 [83] (run accession ERR011189) [84, 85] *MH0055 [86] (run accession ERR011129) [87, 88]. MetaHIT human gut metagenome samples from three Danish females (aged 59 years): *MH0002 [89] (run accession ERR011091) [90, 91] *MH0024 [92] (run accession ERR011149) [93, 94] *MH0082 [95] (run accession ERR011264) [96, 97].


Author information

H Bjørn Nielsen, Mathieu Almeida, H Bjørn Nielsen, Mathieu Almeida, Julian Parkhill and Keith Turner: These authors contributed equally to this work.

Affiliations

Center for Biological Sequence Analysis, Technical University of Denmark, Kongens Lyngby, Denmark

H Bjørn Nielsen, Agnieszka Sierakowska Juncker, Simon Rasmussen, Damian R Plichta, Laurent Gautier, Anders G Pedersen, Ida Bonde, Marcelo B Quintanilha dos Santos, Piotr Dworzynski, Ole Lund, David W Ussery, H Bjørn Nielsen, Agnieszka S Juncker, Simon Rasmussen, Damian R Plichta, Laurent Gautier, Anders G Pedersen, Ida Bonde, Marcelo B Quintanilha dos Santos, Piotr Dworzynski, Ole Lund, David W Ussery, Thomas Sicheritz-Ponten, Søren Brunak, Thomas Sicheritz-Ponten & Søren Brunak

Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark

H Bjørn Nielsen, Agnieszka Sierakowska Juncker, Ida Bonde, Nikolaj Blom, H Bjørn Nielsen, Agnieszka S Juncker, Ida Bonde, Nikolaj Blom, Thomas Sicheritz-Ponten, Søren Brunak, Thomas Sicheritz-Ponten & Søren Brunak

INRA, Institut National de la Recherche Agronomique, UMR 14121 MICALIS, Jouy en Josas, France

Mathieu Almeida, Emmanuelle Le Chatelier, Jean-Michel Batto, Fouad Boumezbeur, Joël Doré, Sean Kennedy, Pierre Léonard, Florence Levenez, Bouziane Moumen, Nicolas Pons, Edi Prifti, Mathieu Almeida, Emmanuelle Le Chatelier, Jean-Michel Batto, Fouad Boumezbeur, Joël Doré, Sean Kennedy, Pierre Leonard, Florence Levenez, Bouziane Moumen, Nicolas Pons, Edi Prifti, Pierre Renault, S Dusko Ehrlich, Alexandre Jamet, Antonella Cultrone, Christine Delorme, Emmanuelle Maguin, Eric Guedon, Gaetana Vandemeulebrouck, Ghalia Khaci, Maarten van de Guchte, Nicolas Sanchez, Rozenn Dervyn, Séverine Layec, Yohanan Winogradski, Pierre Renault & S Dusko Ehrlich

INRA, Institut National de la Recherche Agronomique, US 1367 Metagenopolis, Jouy en Josas, France

Mathieu Almeida, Emmanuelle Le Chatelier, Jean-Michel Batto, Fouad Boumezbeur, Joël Doré, Sean Kennedy, Pierre Léonard, Florence Levenez, Bouziane Moumen, Nicolas Pons, Edi Prifti, Mathieu Almeida, Emmanuelle Le Chatelier, Jean-Michel Batto, Fouad Boumezbeur, Joël Doré, Sean Kennedy, Florence Levenez, Bouziane Moumen, Nicolas Pons, Edi Prifti, S Dusko Ehrlich, Benoit Quinquis, Florence Haimet, Hervé Blottière, Nathalie Galleron & S Dusko Ehrlich

Department of Computer Science, Center for Bioinformatics and Computational Biology, University of Maryland, USA

Mathieu Almeida & Mathieu Almeida

BGI Hong Kong Research Institute, Hong Kong, China

Junhua Li, Junjie Qin, Junhua Li & Junjie Qin

BGI-Shenzhen, Shenzhen, China

Junhua Li, Manimozhiyan Arumugam, Karsten Kristiansen, Junjie Qin, Junhua Li, Junjie Qin, Jun Wang & Jun Wang

School of Bioscience and Biotechnology, South China University of Technology, Guangzhou, China

European Molecular Biology Laboratory, Heidelberg, Germany

Shinichi Sunagawa, Manimozhiyan Arumugam, Jens Roat Kultima, Julien Tap, Takuji Yamada, Shinichi Sunagawa, Manimozhiyan Arumugam, Jens Roat Kultima, Julien Tap, Takuji Yamada, Peer Bork & Peer Bork

Commissariat à l'Énergie Atomique et aux Énergies Alternatives, Institut de Génomique, Évry, France

Eric Pelletier, Denis Le Paslier, Eric Pelletier, Denis Le Paslier, François Artiguenave, Jean Weissenbach & Thomas Bruls

Centre National de la Recherche Scientifique, Évry, France

Eric Pelletier, Denis Le Paslier, Eric Pelletier & Denis Le Paslier

Université d'Évry Val d'Essonne, Évry, France

Eric Pelletier, Denis Le Paslier, Eric Pelletier & Denis Le Paslier

The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark

Trine Nielsen, Manimozhiyan Arumugam, Kristoffer S Burgdorf, Torben Hansen, Oluf Pedersen, Trine Nielsen, Kristoffer S Burgdorf, Torben Hansen, Oluf Pedersen, Jun Wang & Jun Wang

Digestive System Research Unit, University Hospital Vall d'Hebron, Ciberehd, Barcelona, Spain

Chaysavanh Manichanh, Natalia Borruel, Francesc Casellas, Francisco Guarner, Chaysavanh Manichanh, Natalia Borruel, Francesc Casellas, Francisco Guarner, Antonio Torrejon, Encarna Varela & Maria Antolin

Faculty of Health Sciences, University of Southern Denmark, Odense, Denmark

Torben Hansen & Torben Hansen

Department of Structural Biology, VIB, Brussels, Belgium

Falk Hildebrand, Falk Hildebrand & Falony Gwen

Department of Bioscience Engineering, Vrije Universiteit, Brussels, Belgium

Falk Hildebrand, Jeroen Raes, Falk Hildebrand & Jeroen Raes

Division for Epidemiology and Microbial Genomics, National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark

Department of Biology, University of Copenhagen, Copenhagen, Denmark

Karsten Kristiansen, Karsten Kristiansen, Jun Wang & Jun Wang

Hagedorn Research Institute, Gentofte, Denmark

Oluf Pedersen & Oluf Pedersen

Institute of Biomedical Science, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark

Oluf Pedersen, Oluf Pedersen & Niels Grarup

Faculty of Health, Aarhus University, Aarhus, Denmark

Oluf Pedersen & Oluf Pedersen

Department of Microbiology and Immunology, Rega Institute, KU Leuven, Belgium

VIB Center for the Biology of Disease, Leuven, Belgium

Department of Biology, Section of Microbiology, University of Copenhagen, Copenhagen, Denmark

Søren Sørensen & Søren Sørensen

Laboratory of Microbiology, Wageningen University, Wageningen, The Netherlands

Sebastian Tims, Sebastian Tims, Willem M de Vos, Jørgensen Torben, Michiel Kleerebezem & Zoetendal Erwin G

Department of Biological Information, Tokyo Institute of Technology, Yokohama, Japan

Takuji Yamada & Takuji Yamada

Max Delbrück Centre for Molecular Medicine, Berlin, Germany

Princess Al Jawhara Center of Excellence in the Research of Hereditary Disorders, King Abdulaziz University, Jeddah, Saudi Arabia

King's College London, Centre for Host-Microbiome Interactions, Dental Institute Central Office, Guy's Hospital, United Kingdom

S Dusko Ehrlich & S Dusko Ehrlich

Institut Mérieux, Lyon, France

Alexandre Mérieux, Christian Brechot & Christine M'Rini

Danone Research, Palaiseau, France

Gérard Denariaz, Johan E T van Hylckama Vlieg, Muriel Derrien & Patrick Veiga

Gut Biology & Microbiology, Danone Research, Center for Specialized Nutrition, Wageningen, the Netherlands


Watch the video: 4. Η ανακάλυψη της διπλής έλικας του DNA 4 1ο κεφ. - Βιολογία Γ λυκείου. (January 2023).