An analysis of k-mer frequency features with SVM and CNN for viral subtyping classification




CNN, genome, viral subtyping, k-mer, Kameris, Castor, ML-DSP


Viral subtyping classification is very relevant for the appropriate diagnosis and treatment of illnesses. The most used tools are based on alignment-based methods, nevertheless, they are becoming too slow with the increase of genomic data. For that reason, alignment-free methods have emerged as an alternative. In this work, we analyzed four alignment-free algorithms: two methods use k-mer frequencies (Kameris and Castor-KRFE); the third method used a frequency chaos game representation of a DNA with CNNs; finally the last one, process DNA sequences as a digital signal (ML-DSP). From the comparison, Kameris and Castor-KRFE outperformed the rest, followed by the method based on CNNs.


Download data is not yet available.


S. Solis-Reyes, M. Avino, A. Poon, and L. Kari, “An open-source k-mer based machine learning tool for fast and accurate subtyping of hiv-1 genomes,” PloS one, vol. 13, no. 11, 2018.

P. M. Sharp and B. H. Hahn, “Origins of hiv and the aids pandemic,” Cold Spring Harbor perspectives in medicine, vol. 1, no. 1, p. a006841, 2011.

J. B. Joy, R. H. Liang, T. Nguyen, R. M. McCloskey, and A. F. Poon, “Origin and evolution of human immunodeficiency viruses,” in Global Virology Identifying and Investigating Viral Diseases, pp. 587–611, Springer, 2015.

N. Clumeck, A. Pozniak, F. Raffi, and E. E. Committee, “European aids clinical society (eacs) guidelines for the clinical management and treatment of hiv-infected adults,” HIV medicine, vol. 9, no. 2, pp. 65–71, 2008.

D. Lebatteux, A. M. Remita, and A. B. Diallo, “Toward an alignment-free method for feature extraction and accurate classification of viral sequences,” Journal of Computational Biology, vol. 26, no. 6, pp. 519–535, 2019.

S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped blast and psi-blast: a new generation of protein database search programs,” Nucleic acids research, vol. 25, no. 17, pp. 3389–3402, 1997.

S. Duffy, L. A. Shackelton, and E. C. Holmes, “Rates of evolutionary change in viruses: patterns and determinants,” Nature Reviews Genetics, vol. 9, no. 4, pp. 267–276, 2008.

A. Zielezinski, S. Vinga, J. Almeida, and W. M. Karlowski, “Alignment-free sequence comparison: benefits, applications, and tools,” Genome biology, vol. 18, no. 1, p. 186, 2017.

G. S. Randhawa, K. A. Hill, and L. Kari, “Ml-dsp: Machine learning with digital signal processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels,” BMC genomics, vol. 20, no. 1, p. 267, 2019.

T. De Oliveira, K. Deforche, S. Cassol, M. Salminen, D. Paraskevis, C. Seebregts, J. Snoeck, E. J. Van Rensburg, A. M. Wensing, D. A. Van De Vijver, et al., “An automated genotyping system for analysis of hiv-1 and other microbial sequences,” Bioinformatics, vol. 21, no. 19, pp. 3797–3800, 2005.

S. L. K. Pond, D. Posada, E. Stawiski, C. Chappey, A. F. Poon, G. Hughes, E. Fearnhill, M. B. Gravenor, A. J. L. Brown, and S. D. Frost, “An evolutionary model-based algorithm for accurate phylogenetic breakpoint mapping and subtype prediction in hiv-1,” PLoS computational biology, vol. 5, no. 11, 2009.

R. C. Edgar, “Search and clustering orders of magnitude faster than blast,” Bioinformatics, vol. 26, no. 19, pp. 2460–2461, 2010.

R. D. Bjornson, A. Sherman, S. B. Weston, N. Willard, and J. Wing, “Turboblast (r): A parallel implementation of blast built on the turbohub,” in ipdps, p. 0183, IEEE, 2002.

C. Oehmen and J. Nieplocha, “Scalablast: a scalable implementation of blast for high-performance data-intensive bioinformatics analysis,” IEEE Transactions on Parallel and Distributed Systems, vol. 17, no. 8, pp. 740–749, 2006.

D. G. Higgins and P. M. Sharp, “Clustal: a package for performing multiple sequence alignment on a micro-computer,” Gene, vol. 73, no. 1, pp. 237–244, 1988.

S. Vinga, “Alignment-free methods in computational biology,” 2014.

Z. Xing, J. Pei, and E. Keogh, “A brief survey on sequence classification,” ACM Sigkdd Explorations Newsletter, vol. 12, no. 1, pp. 40–48, 2010.

D. Struck, G. Lawyer, A.-M. Ternes, J.-C. Schmit, and D. P. Bercoff, “Comet: adaptive context-based modeling for ultrafast hiv-1 subtype identification,” Nucleic acids research, vol. 42, no. 18, pp. e144–e144, 2014.

M. A. Remita, A. Halioui, A. A. M. Diouara, B. Daigle, G. Kiani, and A. B. Diallo, “A machine learning approach for viral genome classification,” BMC bioinformatics, vol. 18, no. 1, p. 208, 2017.

J. Ren, N. A. Ahlgren, Y. Y. Lu, J. A. Fuhrman, and F. Sun, “Virfinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data,” Microbiome, vol. 5, no. 1, p. 69, 2017.

J. C. F. Silva, T. F. Carvalho, M. F. Basso, M. Deguchi, W. A. Pereira, R. R. Sobrinho, P. M. Vidigal, O. J. Brustolini, F. F. Silva, M. Dal-Bianco, et al., “Geminivirus data warehouse: a database enriched with machine learning approaches,” BMC bioinformatics, vol. 18, no. 1, p. 240, 2017.

B. E. Blaisdell, “A measure of the similarity of sets of sequences not requiring sequence alignment,” Proceedings of the National Academy of Sciences, vol. 83, no. 14, pp. 5155–5159, 1986.

X. Liu, L. Wan, J. Li, G. Reinert, M. S. Waterman, and F. Sun, “New powerful statistics for alignment-free sequence comparison under a pattern transfer model,” Journal of theoretical biology, vol. 284, no. 1, pp. 106–116, 2011.

R. H. Chan, T. H. Chan, H. M. Yeung, and R. W. Wang, “Composition vector method based on maximum entropy principle for sequence comparison,” IEEE/ACM transactions on computational biology and bioinformatics, vol. 9, no. 1, pp. 79–87, 2011.

G. E. Sims, S.-R. Jun, G. A. Wu, and S.-H. Kim, “Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions,” Proceedings of the National Academy of Sciences, vol. 106, no. 8, pp. 2677–2682, 2009.

I. Ulitsky, D. Burstein, T. Tuller, and B. Chor, “The average common substring approach to phylogenomic reconstruction,” Journal of Computational Biology, vol. 13, no. 2, pp. 336–350, 2006.

M. Ghandi, D. Lee, M. Mohammad-Noori, and M. A. Beer, “Enhanced regulatory sequence prediction using gapped k-mer features,” PLoS computational biology, vol. 10, no. 7, 2014.

R. Chikhi and P. Medvedev, “Informed and automated k-mer size selection for genome assembly,” Bioinformatics, vol. 30, no. 1, pp. 31–37, 2014.

A. Pandit and S. Sinha, “Using genomic signatures for hiv-1 sub-typing,” BMC bioinformatics, vol. 11, no. S1, p. S26, 2010.

A. Bansiwal, Analysis of Circulating Recombinant Forms (CRFs) of HIV-1 using Chaos Game Representation (CGR). PhD thesis, IISER M, 2014.

W. Tanchotsrinon, C. Lursinsap, and Y. Poovorawan, “A high performance prediction of hpv genotypes by chaos game representation and singular value decomposition,” BMC bioinformatics, vol. 16, no. 1, p. 71, 2015.

E. Adetiba, J. A. Badejo, S. Thakur, V. O. Matthews, M. O. Adebiyi, and E. F. Adebiyi, “Experimental investigation of frequency chaos game representation for in silico and accurate classification of viral pathogens from genomic sequences,” in International Conference on Bioinformatics and Biomedical Engineering, pp. 155–164, Springer, 2017.

G. S. Randhawa, M. P. Soltysiak, H. El Roz, C. P. de Souza, K. A. Hill, and L. Kari, “Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study,” bioRxiv, 2020.

H. J. Jeffrey, “Chaos game representation of gene structure,” Nucleic acids research, vol. 18, no. 8, pp. 2163– 2170, 1990.

Y. Wang, K. Hill, S. Singh, and L. Kari, “The spectrum of genomic signatures: from dinucleotides to chaos game representation,” Gene, vol. 346, pp. 173–185, 2005.

J. Joseph and R. Sasikumar, “Chaos game representation for comparison of whole genomes,” BMC bioinformatics, vol. 7, no. 1, p. 243, 2006.

A. Fabijańska and S. Grabowski, “Viral genome deep classifier,” IEEE Access, vol. 7, pp. 81297–81307, 2019.

G. Fiscon, E. Weitschek, E. Cella, A. L. Presti, M. Giovanetti, M. Babakir-Mina, M. Ciotti, M. Ciccozzi,

A. Pierangeli, P. Bertolazzi, et al., “Missel: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification,” BioData Mining, vol. 9, no. 1, p. 38, 2016.




How to Cite

Machaca Arceda, V. E. (2020). An analysis of k-mer frequency features with SVM and CNN for viral subtyping classification. Journal of Computer Science and Technology, 20(2), e11.



Original Articles