Skip to main content Accessibility help
×
Home

An efficient classification algorithm for NGS data based on text similarity

  • Xiangyu Liao (a1), Xingyu Liao (a2), Wufei Zhu (a3), Lu Fang (a3) and Xing Chen (a3)...

Abstract

With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      An efficient classification algorithm for NGS data based on text similarity
      Available formats
      ×

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      An efficient classification algorithm for NGS data based on text similarity
      Available formats
      ×

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      An efficient classification algorithm for NGS data based on text similarity
      Available formats
      ×

Copyright

Corresponding author

Author for correspondence: Wufei Zhu, E-mail: zhuwufei@aliyun.com

Footnotes

Hide All
*

These two authors contributed equally to this study

Footnotes

References

Hide All
Bao, E, Jiang, T, Kaloshian, I, Girke, T (2011) SEED: efficient clustering of next-generation sequences. Bioinformatics 27(18), 25022509.
Bruneau, M, Mottet, T, Moulin, S, Kerbiriou, M, Chouly, F, Chretien, S, Guyeux, C (2018) A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Model. Computers in Biology and Medicine 93, 6674.
Deorowicz, S, Kokot, M, Grabowski, S, Debudaj-Grabysz, A (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 15691576.
Edgar, RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19), 24602461.
Ennis, D, Dascalu, S, Harris, FC Jr (2016) Leveraging clustering techniques to facilitate metagenomic analysis. Intelligent Automation & Soft Computing 22(1), 153165.
Ghodsi, M, Liu, B, Pop, M (2011) DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinformatics 12(1), 271.
Gnerre, S, MacCallum, I, Przybylski, D, Ribeiro, FJ, Burton, JN, Walker, BJ, Sharpe, T, Hall, G, Shea, TP, Sean, S, Berlin, AM, Aird, D, Costello, M, Daza, R, Williams, L, Nicol, R, Gnirke, A, Nusbaum, C, Lander, E S, Jaffe, DB (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences 108(4), 15131518.
Gurevich, A, Saveliev, V, Vyahhi, N, Tesler, G (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 10721075.
Hauser, M (2014) MMseqs: ultra fast and sensitive clustering and search of large protein sequence databases[D]. Ludwig-Maximilians-Universität München.
Huang, Y, Niu, B, Gao, Y, Fu, L, Li, W (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26(5), 680682.
Inzalkar, S, Sharma, J (2015) A survey on text mining-techniques and application. International Journal of Research In Science & Engineering 24, 114.
Jain, S, Pandey, M (2012) Hash table based word searching algorithm. International Journal of Computer Science and Information Technologies 3(3), 43854388.
James, BT, Luczak, BB, Girgis, HZ (2017) MeShClust: an intelligent tool for clustering DNA sequences. bioRxiv 207720.
Jiang, L, Dong, Y, Chen, N, Chen, T (2016) DACE: a scalable DP-means algorithm for clustering extremely large sequence data. Bioinformatics 33(6), 834842.
Kelley, DR, Schatz, MC, Salzberg, SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biology 11(11), R116.
Kim, Y, Koh, IS, Rho, M (2015) Deciphering the human microbiome using next-generation sequencing data and bioinformatics approaches. Methods 79, 5259.
Langmead, B, Salzberg, SL (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4), 357.
Li, H (2011) WGSIM-read simulator for next generation sequencing. https://github.com/lh3/wgsim (11 May 2015 date last accessed).
Li, W (2015) Fast program for clustering and comparing large sets of protein or nucleotide sequences[M]//Encyclopedia of Metagenomics. Springer US 173177.
Li, W, Chang, Y (2017) CD-HIT-OTU-MiSeq, an improved approach for clustering and analyzing paired end MiSeq 16S rRNA sequences. bioRxiv 153783.
Li, W, Fu, L, Niu, B, Wu, S, Wooley, J (2012) Ultrafast clustering algorithms for metagenomic sequence analysis. Briefings in Bioinformatics 13(6), 656668.
Li, W, Godzik, A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 16581659.
Lin, YS, Jiang, JY, Lee, SJ (2014) A similarity measure for text classification and clustering. IEEE Transactions on Knowledge and Data Engineering 26(7), 15751590.
Liu, B, Shi, Y, Yuan, J, Hu, X, Zhang, H, Li, N, Li, Z, Chen, Y, Mu, D, Fan, W (2013) Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. Preprint at https://arxiv.org/abs/1308.2012.
Liu, L, Lin, Z, Shao, L, Shen, F, Ding, G, Han, J (2017) Sequential discrete hashing for scalable cross-modality similarity retrieval. IEEE Transactions on Image Processing 26(1), 107118.
Luo, R, Liu, B, Xie, Y, Li, Z, Huang, W, Yuan, J, He, G, Chen, Y, Pan, Q, Liu, Y, Tang, J, Wu, G, Zhang, H, Shi, Y, Liu, Y, Yu, C, Wang, B, Lu, Y, Han, C, Cheung, DW, Yiu, S, Peng, S, Xiaoqian, Z, Liu, G, Liao, X, Li, Y, Yang, H, Wang, J (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1(1), 18.
Mahmud, MP, Schliep, A (2014) TreQ-CG: clustering accelerates high-throughput sequencing read mapping. arXiv preprint arXiv, 1404.2872.
Marcais, G, Kingsford, C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764770.
Medini, D, Serruto, D, Parkhill, J, Relman, DA, Donati, C, Moxon, R, Falkow, S, Rappuoli, R (2008) Microbiology in the post-genomic era. Nature Reviews Microbiology 6(6), 419.
Oramas, S, Sordo, M, Espinosa-Anke, L, Serra, X (2015) A semantic-based approach for artist similarity[C]//ISMIR. 100106.
Pu, H, Fei, G, Zhao, H, Hu, G, Jiao, C, Xu, Z (2017) Short text similarity calculation using semantic information[C]//big data computing and communications (BIGCOM), 2017 3rd International Conference on. IEEE 144150.
Rahman, MA, LaPierre, N, Rangwala, H, Barbara, D (2017) Metagenome sequence clustering with hash-based canopies. Journal of Bioinformatics and Computational Biology 15(06), 1740006.
Rizk, G, Lavenier, D, Chikhi, R (2013) DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652653.
Sohn, J, Nam, JW. (2016) The present and future of de novo whole-genome assembly. Briefings in Bioinformatics 19(1), 2340.
Wang, J, Feng, J, Li, G (2010) Trie-join: efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment 3(1–2), 12191230.
Wen, Z, Deng, D, Zhang, R, Ramamohanarao, K (2017) A technical report: entity extraction using both character-based and token-based similarity. arXiv preprint arXiv 1702.03519.

Keywords

An efficient classification algorithm for NGS data based on text similarity

  • Xiangyu Liao (a1), Xingyu Liao (a2), Wufei Zhu (a3), Lu Fang (a3) and Xing Chen (a3)...

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed