Data stream treatment using sliding windows with MapReduce

Authors

  • María José Basgall Instituto de Investigación en Informática (III-LIDI), Facultad de Informática - Universidad Nacional de La Plata
  • Waldo Hasperué Instituto de Investigación en Informática (III-LIDI), Facultad de Informática - Universidad Nacional de La Plata
  • Marcelo Naiouf Instituto de Investigación en Informática (III-LIDI), Facultad de Informática - Universidad Nacional de La Plata

Keywords:

big data, mapreduce, stream processing

Abstract

Knowledge Discovery in Databases (KDD) techniques present limitations when the volume of data to process is very large. Any KDD algorithm needs to do several iterations on the complete set of data in order to carry out its work. For continuous data stream processing it is necessary to store part of it in a temporal window. In this paper, we present a technique that uses the size of the temporal window in a dynamic way, based on the frequency of the data arrival and the response time of the KDD task. The obtained results show that this technique reaches a great size window where each example of the stream is used in more than one iteration of the KDD task.

Downloads

Download data is not yet available.

References

[1] N. Takahashi et al., “A parallelized data stream processing system using dynamic time warping distance,” in 2009 International Conference on Complex, Intelligent and Software Intensive Systems, Fukuoka, Japan, March 16-19, 2009, pp. 1100–1105.
[2] Y. Noh et al., “Real-time data stream processing for ubiquitous home network systems,” in 4th International Conference on Multimedia and Ubiquitous Engineering, MUE 2010, Cebu, Philippines, 11-13 August, 2010.
[3] C. Kuka, “Processing the uncertainty: Quality-aware data stream processing for dynamic context models,” in Pervasive Computing and Communications Workshops (PERCOM Workshops), 2012 IEEE International Conference on, pp. 560–561, March 2012.
[4] D. Bonino and F. Corno, “spchains: A declarative framework for data stream processing in pervasive applications,” Procedia Computer Science, vol. 10, 2012.
[5] J. Stefanowski et al., “Processing and mining complex data streams,” Inf. Sci., vol. 285, pp. 63–65, 2014.
[6] R. Agerri et al., “Big data for natural language processing: A streaming approach,” Knowledge-Based Systems, vol. 79, 2015.
[7] Y. Ma et al., “Remote sensing big data computing,” Future Gener. Comput. Syst., vol. 51, pp. 47–60, Oct. 2015.
[8] P. ZareMoodi et al., “Novel class detection in data streams using local patterns and neighborhood graph,” Neurocomput., vol. 158, pp. 234–245, June 2015.
[9] D. Desai and A. Joshi, “A deviant load shedding system for data stream mining,” Procedia Computer Science, vol. 45, 2015. International Conference on Advanced Computing Technologies and Applications (ICACTA).
[10] A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. New York, NY, USA: Cambridge University Press, 2011.
[11] G. Hager and G. Wellein, Introduction to High Performance Computing for Scientists and Engineers. (”Chapman and Hall/CRC” Computational Science), CRC Press, 2010.
[12] P. Pacheco, An Introduction to Parallel Programming. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1st ed., 2011.
[13] S. Zhang et al., “Cloud computing research and development trend,” in Future Networks, 2010. ICFN ’10. Second International Conference on, Jan 2010.
[14] S. S. Saurabh Bilgaiyan and S. S. Sahu, “Cloud computing: Concept, terminologies, issues, recent technologies,” Research Journal of Applied Sciences, vol. 9, pp. 614–618, 2014.
[15] S.-S. Kim and H.-K. Ahn, “An improved data stream algorithm for clustering,” Computational Geometry, vol. 48, no. 9, 2015.
[16] E. Lughofer and M. Sayed-Mouchaweh, “Autonomous data stream clustering implementing split-and-merge concepts - towards a plug-and-play approach,” Inf. Sci., vol. 304, May 2015.
[17] A. S. Asensio et al., “Improving data partition schemes in smart grids via clustering data streams,” Expert Systems with Applications, vol. 41, no. 13, pp. 5832 – 5842, 2014.
[18] Y. Li, D. Li, S. Wang, and Y. Zhai, “Incremental entropy-based clustering on categorical data streams with concept drift,” Know.-Based Syst., vol. 59, Mar. 2014.
[19] Z. Miller et al., “Twitter spammer detection using data stream clustering,” Information Sciences, vol. 260, pp. 64 – 73, 2014.
[20] R. Mythily et al., “Clustering models for data stream mining,” Procedia Computer Science, vol. 46. Proceedings of the International Conference on Information and Communication Technologies, December 2014, Kochi, India.
[21] PhridviRaj et al., “Clustering text data streams - a tree based approach with ternary function and ternary feature vector,” Procedia Computer Science, vol. 31, 2014. 2nd International Conference on Information Technology and Quantitative Management, {ITQM}.
[22] M. Z. ur Rehman et al., “Hyper-ellipsoidal clustering technique for evolving data stream,” Knowledge-Based Systems, vol. 70, 2014.
[23] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, pp. 281–297, Oakland, CA, USA., 1967.
[24] Apache Hadoop. https://hadoop.apache.org/. Accessed 08/2016.
[25] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Commun. ACM, vol. 51, Jan. 2008.
[26] E. Rasmussen, “Information retrieval,” ch. Clustering Algorithms, pp. 419–442, Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1992.
[27] Spark Streaming. http://spark-project.org/. Accessed 08/2016.

Downloads

Published

2016-11-01

How to Cite

Basgall, M. J., Hasperué, W., & Naiouf, M. (2016). Data stream treatment using sliding windows with MapReduce. Journal of Computer Science and Technology, 16(02), p. 76–83. Retrieved from https://journal.info.unlp.edu.ar/JCST/article/view/498

Issue

Section

Original Articles

Most read articles by the same author(s)

1 2 > >>