A new AntTree-based algorithm for clustering short-text corpora

  • Marcelo Luis Errecalde Development and Research Laboratory in Computacional Intelligence (LIDIC), Universidad Nacional de San Luis, San Luis, Argentina
  • Diego Alejandro Ingaramo Development and Research Laboratory in Computacional Intelligence (LIDIC), Universidad Nacional de San Luis, San Luis, Argentina
  • Paolo Rosso Natural Language Engineering Lab.,ELiRF, Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Valencia, Spain
Keywords: internal validity measures, AntTree, Short-text clustering, Bio-inspired algorithms, Internal Validity Measures, Silhouette Coefficient

Abstract

Research work on "short-text clustering" is a very important research area due to the current tendency for people to use "small-language", e.g. blogs, textmessaging and others. In some recent works, new bioinspired clustering algorithms have been proposed to deal with this difficult problem and novel uses of Internal Clustering Validity Measures have also been presented. In this work, a new AntTree-based approach is proposed for this task. It integrates information on the Silhouette Coefficient and the concept of attraction of a cluster in different stages of the clustering process. The proposal achieves results comparable to the best reported results in this area, showing an interesting stability in the quality of the results and presenting some interesting capabilities as a general improvement method for arbitrary clustering approaches.

Downloads

Download data is not yet available.

References

[1] M. Alexandrov, A. Gelbukh, and P. Rosso. An approach to clustering abstracts. In Proc. of NLDB-05, volume 3513 of LNCS, pages 8–13. Springer-Verlag, 2005.
[2] H. Azzag, N. Monmarche, M. Slimane, G. Venturini, and C. Guinot. AntTree: A new model for clustering with artificial ants. In Proc. of the CEC2003, pages 2642–2647, Canberra, 8-12 December 2003. IEEE Press.
[3] L. Cagnina, M. Errecalde, D. Ingaramo, and P. Rosso. A discrete particle swarm optimizer for clustering short-text corpora. In BIOMA08, pages 93–103, 2008.
[4] M. Errecalde and D. Ingaramo. Short-text corpora for clustering evaluation. Technical report, LIDIC, 2008.
[5] M. Errecalde, D. Ingaramo, and P. Rosso. Proximity estimation and hardness of short-text corpora. In Proceedings of 5th Int. Workshop on Text-based Information Retrieval (TIR-2008), pages 15–19, 2008.
[6] D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139–172, 1987.
[7] D. Ingaramo, M. Errecalde, L. Cagnina, and P. Rosso. Computational Intelligence and Bio-engineering, chapter Particle Swarm Optimization for clustering short-text corpora, pages 3–19. IOS press, 2009.
[8] D. Ingaramo, David Pinto, P. Rosso, and M. Errecalde. Evaluation of internal validity measures in short-text corpora. In Proc. of the CICLing 2008 Conf., volume 4919 of LNCS, pages 555–567. Springer-Verlag, 2008.
[9] P. Makagonov, M. Alexandrov, and A. Gelbukh. Clustering abstracts instead of full texts. In Proc. of TSD-2004, volume 3206 of LNAI, pages 129–135, 2004.
[10] D. Pinto, J. M. Benedí, and P. Rosso. Clustering narrow-domain short texts by using the Kullback-Leibler distance. In Proc. of the CICLing 2007 Conf., volume 4394 of LNCS, pages 611–622. Springer-Verlag, 2007.
[11] D. Pinto and P. Rosso. On the relative hardness of clustering corpora. In Proc. of TSD07, volume 4629 of LNAI, pages 155–161. Springer-Verlag, 2007.
[12] Peter Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analy- sis. J. Comput. Appl. Math., 20(1):53–65, 1987.
[13] B. Stein, S. Meyer zu Eissen, and F. Wißbrock. On cluster validity and the information need of users. In Proc. of the IASTED03, pages 216–221, 2003.
[14] Benno Stein and Sven Meyer zu Eißen. Document Categorization with MAJORCLUST. In Proc. WITS 02, pages 91–96. Technical University of Barcelona, 2002.
[15] Y. Zhao and G. Karypis. Empirical and theoretical comparison of selected criterion functions for document clustering. Machine Learning, 55:311–331, 2004.
Published
2010-04-01
How to Cite
Errecalde, M. L., Ingaramo, D. A., & Rosso, P. (2010). A new AntTree-based algorithm for clustering short-text corpora. Journal of Computer Science and Technology, 10(01), p. 1-7. Retrieved from http://journal.info.unlp.edu.ar/JCST/article/view/708
Section
Invited Articles