Statistical analysis of the performance of four Apache Spark ML algorithms




Big Data, Machine Learning, Classification Models, Apache Spark, Spark ML, Wilcoxon Test, Student’s T Test


Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess the
importance of each feature for a particular task. However, due to the increasing size of currently available
databases, distributed processing has become a necessity for many tasks. In this context, the Apache Spark
ML library is one of the most widely used libraries for performing classification and other tasks with large
datasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms before
applying a FS technique is crucial to planning computations and saving time. In this work, a comparative
study of four Spark ML classification algorithms is carried out, statistically measuring execution times and
predictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Na¨ıve Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.


Download data is not yet available.


G. Hern´andez, E. Zamora, H. Sossa, G. T´ellez, and F. Furl´an, “Hybrid neural networks for big data classification,” Neurocomputing, vol. 390, pp. 327–340, 2020.

W. Xing and Y. Bei, “Medical health big data classification based on knn classification algorithm,” IEEE Access, vol. 8, pp. 28808–28819, 2019.

S. Lakshmanaprabu, K. Shankar, M. Ilayaraja, A. W. Nasir, V. Vijayakumar, and N. Chilamkurti, “Random forest for big data classification in the internet of things using optimal features,” International journal of machine learning and cybernetics, vol. 10, no. 10, pp. 2609–2618, 2019.

E. M. Hassib, A. I. El-Desouky, L. M. Labib, and E.-S. M. El-kenawy, “Woa+ brnn: An imbalanced big data classification framework using whale optimization and deep neural network,” soft computing, vol. 24, no. 8, pp. 5573–5592, 2020.

A. K. Dubey, A. Kumar, and R. Agrawal, “An efficient aco-pso-based framework for data classification and preprocessing in big data,” Evolutionary Intelligence, vol. 14, no. 2, pp. 909–922, 2021.

U. Gurav and N. Sidnal, “Predict stock market behavior: Role of machine learning algorithms,” in Intelligent Computing and Information and Communication, pp. 383–394, Springer, 2018.

F. Ronchetti, F. Quiroga, G. Camele, W. Hasperu´e, and L. Lanzarini, “Un estudio de la generalizaci´on en la clasificaci´on de peatones,” Revista Cubana de Transformaci´on Digital, vol. 2, no. 1, pp. 33–45, 2021.

K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis, “Machine learning applications in cancer prognosis and prediction,” Computational and structural biotechnology journal, vol. 13, pp. 8–17, 2015.

G. Kou, P. Yang, Y. Peng, F. Xiao, Y. Chen, and F. E. Alsaadi, “Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods,” Applied Soft Computing, vol. 86, p. 105836, 2020.

J. Cai, J. Luo, S.Wang, and S. Yang, “Feature selection in machine learning: A new perspective,” Neurocomputing, vol. 300, pp. 70–79, 2018.

S. Alelyani, J. Tang, and H. Liu, “Feature selection for clustering: A review,” Data Clustering, pp. 29–60, 2018.

S. Solorio-Fern´andez, J. A. Carrasco-Ochoa, and J. F. Mart´ınez-Trinidad, “A review of unsupervised feature selection methods,” Artificial Intelligence Review, vol. 53, no. 2, pp. 907–948, 2020.

R. Zebari, A. Abdulazeez, D. Zeebaree, D. Zebari, and J. Saeed, “A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction,” Journal of Applied Science and Technology Trends, vol. 1, no. 2, pp. 56–70, 2020.

G. Camele, W. Hasperu´e, F. Ronchetti, and F. Quiroga, “A comparative study of the performance of four classification algorithms from the apache sparkml library,” Congreso Argentino de Ciencias de la Computaci´on, 2020.

E. Pashaei and N. Aydin, “Binary black hole algorithm for feature selection and classification on biological data,” Applied Soft Computing, vol. 56, 03 2017.

T. Pranckeviˇcius and V. Marcinkeviˇcius, “Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification,” Baltic Journal of Modern Computing, vol. 5, no. 2, p. 221, 2017.

H. Ahmed, E. M. Younis, A. Hendawi, and A. A. Ali, “Heart disease identification from patients’ social posts, machine learning solution on spark,” Future Generation Computer Systems, vol. 111, pp. 714–722, 2020.

D. Moldovan, M. Antal, C. Pop, A. Olosutean, T. Cioara, I. Anghel, and I. Salomie, “Spark-based classification algorithms for daily living activities,” in Computer Science On-line Conference, pp. 69–78, Springer, 2018.

S. Saravanan et al., “Performance evaluation of classification algorithms in the design of apache spark based intrusion detection system,” in 2020 5th International Conference on Communication and Electronics Systems (ICCES), pp. 443–447, IEEE, 2020.

J. Xianya, H. Mo, and L. Haifeng, “Stock classification prediction based on spark,” Procedia Computer Science, vol. 162, pp. 243–250, 2019.

W. S. Albaldawi and R. M. Almuttairi, “Comparative study of classification algorithms to analyze and predict a twitter sentiment in apache spark,” in IOP Conference Series: Materials Science and Engineering, vol. 928, p. 032045, IOP Publishing, 2020.

S. Yasrobi, J. Alston, B. Yadranjiaghdam, and N. Tabrizi, “Performance analysis of sparks machine learning library.,” Trans. MLDM, vol. 10, no. 2, pp. 67–77, 2017.

Z. Botev and A. Ridder, Variance Reduction, pp. 1–6. American Cancer Society, 2017.




How to Cite

Camele, G., Hasperué, W., Ronchetti, F., & Quiroga, F. M. (2022). Statistical analysis of the performance of four Apache Spark ML algorithms. Journal of Computer Science and Technology, 22(2), e14.



Original Articles

Most read articles by the same author(s)