

  • 宏基因组二、三代测序混合组装软件OPERA-MS
    • 摘要
    • 主要结果
      • 图1. OPERA-MS工作流程图
      • 图2. 宏基因组数据混合组装基因组评测
      • 图3. 组装虚拟肠道微生物组
      • 图4. 移动元件和与人肠道微生物组中宿主物种的关联
    • 总结
Short reads are first assembled by a metagenomic assembler into contigs, and short and long reads are mapped to them to obtain coverage information and spanning reads (Step 1). Spanning reads are then bundled to get edges between contigs for an assembly graph that represents the contiguity information of the whole metagenome (Step 2). Contigs are organized into a hierarchical clustering where the distance between contigs increases with genomic distance and their difference in coverage (Step 3). The tree is then cut into optimal clusters based on the BIC (Step 4). Optionally, to improve the clustering for species where a reference genome is available, the Mash genomic distance between each cluster and a database of complete bacterial genomes is computed (Step 5). Clusters are then merged if there is supporting information in the assembly graph to form species-specific super-clusters (Step 6). These super-clusters are further analyzed to deconvolute contigs that come from distinguishable subspecies genomes (Step 7). Finally, each cluster is independently scaffolded and gap-filled using a program meant for isolate genomes (OPERA-LG; Step 8).

图2. 宏基因组数据混合组装基因组评测

Fig. 2: Benchmarking hybrid assembly of genomes from metagenomes.



b,与不同覆盖范围内的其他组装软件相比,使用OPERA-MS获得组装连续性(NGA50)的改进情况。点代表在宏基因组中具有至少两个菌株的物种(在GIS20和S2中存在的物种,如MetaPhlAn2报道的丰度 > 0.1%(参考文献49)(v.2.6.0))。按照覆盖度的上升,组装的基因组的数量对于Canu是1,对于其他方法是2,6,4和5个。数据以箱形图表示(中心线,中位数;箱限,上下四分位数; 须线,1.5×四分位数间距; 点,异常值)。


d,在分箱后评估仅Illumina数据(M,MEGAHIT)和混合(H,hybridSPAdes; O,OPERA-MS)组装宏基因组组装以用于下游分析。包含最大部分参考基因组的区域(GIS20参考文献;具有粗体名称的物种在宏基因组中具有至少两个菌株)评估以下参数:(1)基因组完整性,在分箱中基因组的比例,(2)基因组纯度,分箱中碱基对应正确参考的百分比,(3)基因完整性,在分箱中完全组装的基因比例和(4)通路完整性,其组成基因超过90%的通路出现在组装的分箱中。

a, Construction of a virtual gut microbiome that represents a complex metagenomic data set while retaining the ability to evaluate assemblies against gold-standard references. b, Improvement in assembly contiguity (NGA50) obtained using OPERA-MS compared with other assemblers over different coverage ranges. Dots represent species that have at least two strains in the metagenome (species present in GIS20 and S2 with an abundance >0.1% as reported by MetaPhlAn2 (ref. 49) (v.2.6.0)). The number of assembled genomes, in ascending order of coverage, was 1 for Canu and 2, 6, 4 and 5 for the other methods. Data are presented as box plots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers). c, Comparison of misassembly rates (one dot per genome) for different assemblers, with solid lines indicating median values. d, Evaluation of Illumina-only (M, MEGAHIT) and hybrid (H, hybridSPAdes; O, OPERA-MS) metagenomic assemblies after binning for their utility in downstream analysis. Bins that contained the largest fraction of a reference genome (GIS20 references; species with bold names have at least two strains in the metagenome) were evaluated for (1) genome completeness, the fraction of the genome represented in the bin, (2) genome purity, percentage of bases in the bin that correspond to the correct reference, (3) gene completeness, fraction of genes that were fully assembled in the bin and (4) pathway completeness, fraction of pathways with over 90% of their constituent genes being assembled and binned together.

图4. 移动元件和与人肠道微生物组中宿主物种的关联

Fig. 4: Mobile elements and association with host species in the human gut microbiome.







