Taxonomic assignment for large-scale metagenomic data on high-perfomance systems

Vinh Van Le, Hoai Van Tran, Hieu Ngoc Duong, Giang Xuan Bui, Lang Van Tran


Metagenomics is a powerful approach to study environment samples which do not require the isolation and cultivation of individual organisms. One of the essential tasks in a metagenomic project is to identify the origin of reads, referred to as taxonomic assignment. Due to the fact that each metagenomic project has to analyze large-scale datasets, the metatenomic assignment is very much computation intensive. This study proposes a parallel algorithm for the taxonomic assignment problem, called SeMetaPL, which aims to deal with the computational challenge. The proposed algorithm is evaluated with both simulated and real datasets on a high performance computing system. Experimental results demonstrate that the algorithm is able to achieve good performance and utilize resources of the system efficiently. The software implementing the algorithm and all test datasets can be downloaded at


DNA sequences, homology search, metagenomics, parallel algorithm, taxonomic assignment

Full Text:



S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of molecular biology, vol. 215, no. 3, pp. 403–410, 1990.

A. E. Darling, L. Carey, and W. C. Feng, “The design, implementation, and evaluation of mpiblast,” Los Alamos National Laboratory, Tech. Rep., 2003.

N. N. Diaz, L. Krause, A. Goesmann, K. Niehaus, and T. W. Nattkemper, “Tacoa–taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach,” BMC bioinformatics, vol. 10, no. 1, p. 56, 2009.

W. Gerlach and J. Stoye, “Taxonomic classification of metagenomic shotgun sequences with carma3,” Nucleic acids research, vol. 39, no. 14, pp. e91–e91, 2011.

J. Handelsman, The new science of metagenomics: Revealing the secrets of out microbial planet. The National Academies Press, 2007.

W. Huang, L. Li, J. R. Myers, and G. T. Marth, “Art: a next-generation sequencing read simulator,” Bioinformatics, vol. 28, no. 4, pp. 593–594, 2011.

D. H. Huson, S. Mitra, H. J. Ruscheweyh, N. Weber, and S. C. Schuster, “Integrative analysis of environmental sequences using megan4,” Genome research, vol. 21, no. 9, pp. 1552–1560, 2011.

D. Langenk¨amper, A. Goesmann, and T. W. Nattkemper, “Ake-the accelerated k-mer exploration web-tool for rapid taxonomic classification and visualization,” BMC bioinformatics, vol. 15.

S. S. Mande, M. H. Mohammed, and T. S. Ghosh, “Classification of metagenomic sequences: methods and challenges,” Briefings in bioinformatics, vol. 13, no. 6, pp. 669–681, 2012.

M. H. Mohammed, T. S. Ghosh, N. K. Singh, and S. S. Mande, “Sphinx - an algorithm for taxonomic binning of metagenomic sequences,” Bioinformatics, vol. 27, no. 1, pp. 22 – 30, January 2011.

R. Ounit, S. Wanamaker, T. J. Close, and S. Lonardi, “Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers,” BMC genomics, vol. 16, no. 1, p. 236, 2015.

Z. Rasheed and H. Rangwala, “A map-reduce framework for clustering metagenomes,” in Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International. IEEE, 2013, pp. 549–558.

J. Shendure and H. Ji, “Next-generation dna sequencing,” Nature biotechnology, vol. 26, no. 10, pp. 1135–1145, 2008.

X. Su, J. Xu, and K. Ning, “Parallel-meta: efficient metagenomic data analysis based on highperformance computation,” BMC Systems Biology, vol. 6, no. 1, p. S16, 2012.

H. Teeling and F. O. Gl¨ockner, “Current opportunities and challenges in microbial metagenome analysisa bioinformatic perspective,” Briefings in bioinformatics, vol. 13, no. 6, pp. 728–742, 2012.

G. W. Tyson, J. Chapman, P. Hugenholtz, E. E. Allen, R. J. Ram, P. M. Richardson, V. V. Solovyev, E. M. Rubin, D. S. Rokhsar, and J. F. Banfield, “Community structure and metabolism through reconstruction of microbial genomes from the environment,” Nature, vol. 428, no. 6978, pp. 37–43, 2004.

V. Van Le, L. Van Tran, and H. Van Tran, “A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads,” BMC bioinformatics, vol. 17, no. 22, 2016.

Y. Wang, H. C. M. Leung, S. M. Yiu, and F. Y. L. Chin, “Metacluster-ta: taxonomic annotation for metagenomic databased on assembly-assisted binning,” BMC Genomics, vol. 15, 2014.

X. Yang, J. Zola, and S. Aluru, “Large-scale metagenomic sequence clustering on map-reduce clusters,” Journal of bioinformatics and computational biology, vol. 11, no. 01, p. 1340001, 2013.

DOI: Display counter: Abstract : 130 views. PDF : 58 views.


Journal of Computer Science and Cybernetics ISSN: 1813-9663

Published by Vietnam Academy of Science and Technology