SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
Keywords:Big Data, Imbalanced classification, Preprocessing, SMOTE, Spark
The volume of data in today's applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard.
In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes.
In this work we present SMOTE-BD, a fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.
V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Information Sciences, vol. 250, no. 20, pp. 113–141, 2013.
B. Krawczyk, “Learning from imbalanced data: open challenges and future directions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, 2016.
D. Galpert, A. Fernández, F. Herrera, A. Antunes, R. Molina-Ruiz, and G. AgÃOEero-Chapin, “Surveying alignment-free features for ortholog detection in related yeast proteomes by using supervised big data classifiers,” BMC Bioinformatics, vol. 19, no. 1, 2018.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over–sampling technique,” Journal of Artificial Intelligent Research, vol. 16, pp. 321–357, 2002.
A. Fernandez, S. Garcia, F. Herrera, and N. Chawla, “Smote for learning from imbalanced data: Progress and challenges. marking the 15-year anniversary,” Journal of artificial intelligence research, vol. 61, pp. 863–905, 2018.
R. C. Prati, G. E. A. P. A. Batista, and D. F. Silva, “Class imbalance revisited: a new experimental setup to assess the performance of treatment methods,” Knowledge and Information Systems, vol. 45, no. 1, pp. 247–270, 2015.
C. P. Chen and C.-Y. Zhang, “Data-intensive applications, challenges, techniques and technologies: A survey on Big Data,” Information Sciences, vol. 275, pp. 314–347, 2014.
A. Fernández, S. Río, V. López, A. Bawakid, M. J. del Jesus, J. Benítez, and F. Herrera, “Big Data with cloud computing: An insight on the computing environment,MapReduce and programming framework,” WIREs Data Mining and Knowledge Discovery, vol. 4, no. 5, pp. 380–409, 2014.
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in HotCloud 2010, pp. 1–7, 2010.
A. Fernandez, S. del Rio, N. V. Chawla, and F. Herrera, “An insight into imbalanced big data classification: Outcomes and challenges,” Complex and Intelligent Systems, vol. 3, no. 2, pp. 105–120, 2017.
S. Ramírez-Gallego, A. Fernández, S. García, M. Chen, and F. Herrera, “Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with mapreduce,” Information Fusion, vol. 42, pp. 51–61, 2018.
J. Maillo, S. Ramírez-Gallego, I. Triguero, and F. Herrera, “knn–is: An iterative spark-based design of the k-nearest neighbors classifier for big data.,” Knowledge-Based Systems, vol. 117, pp. 3–15, 2017.
J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera, “KEEL data–mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” Journal of Multi–Valued Logic and Soft Computing, vol. 17, no. 2-3, pp. 255–287, 2011.
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar, “MLlib: Machine learning in apache spark,” Journal of Machine Learning Research, vol. 17, no. 34, pp. 1–7, 2016.
R. Barandela, J. S. Sánchez, V. García, and E. Rangel, “Strategies for learning in class imbalance problems.,” Pattern Recognition, vol. 36, no. 3, pp. 849–851, 2003.