Keyword identification in Spanish documents using neural networks
Keywords:keyword extraction, autoencoders, neural networks
The large amount of textual information digitally available today gives rise to the need for effective means of indexing, searching and retrieving this information. Keywords are used to describe briefly and precisely the contents of a textual document. In this paper we present an algorithm for keyword extraction from documents written in Spanish.This algorithm combines autoencoders, which are adequate for highly unbalanced classification problems, with the discriminative power of conventional binary classifiers. In order to improve its performance on larger and more diverse datasets, our algorithm trains several models of each kind through bagging.
 Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval, vol. 2,303--336 (2000).
 Witten, I. H., Paynter, G. W., Frank, E., Gutwin C., Neville-Manning, C. G.: KEA: Practical Automatic Keyphrase Extraction. In Proceedings of the 4th ACM Conference on Digital Libraries, pp. 254--255 (1998).
 Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in NLP, pp. 216--223 (2003).
 Hasan, K. S., Ng V.: Automatic Keyphrase Extraction: A Survey of the State of the Art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1262--1273 (2014).
 Medelyan, O.: Human-competitive automatic topic indexing. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1318--1327, Association for Computational Linguistics (2009).
 Kim, S. N., Medelyan, O., Kan, M., Baldwin, T. SemEval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles. In Proceedings of the 5th International Workshop on Semantic Evaluation. pp. 21--26 (2010).
 WEKA, http://www.cs.waikato.ac.nz/ml/weka/, accessed in July 2015.
 Aquino, G, Hasperué, W, Lanzarini, L. Keyword Extraction using Auto-associative Neural Networks. XX Congreso Argentino en Ciencias de la Computación (2014).
 Japkowicz, N, Myers, C, Gluck, M.: A Novelty Detection Approach to Classification. Proceedings of the Fourteenth Joint Conference on Artificial Intelligence, pp. 518--523 (1995).
 Breiman, L.: Bagging Predictors. Machine Learning, pp. 123--140 (1996).
 Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI), pp. 111--117 (2000).
 Fürnkranz, J.: A Study Using n-gram Features for Text Categorization (1998).
 OpenNLP, http://opennlp.apache.org/, accessed in July 2015.
 Conference on Computational Natural Language Learning (CoNLL-2002), http://www.clips.ua.ac.be/conll2002/ner/, accessed in July 2015.
 Expert Advisory Group on Language Engineering Standards (EAGLES), http://www.ilc.cnr.it/EAGLES96/home.html, accessed in July 2015.
 Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management, pp. 513--523 (1988).
 Andrade, M.A., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics, vol. 14, no. 7, pp. 600--607 (1998).
 Riedmiller, M.: Advanced Supervised Learning in Multi-layer Perceptrons - From Backpropagation to Adaptive Learning Algorithms (1994).
 Congreso Argentino en Ciencias de la Computación, http://redunci.info.unlp.edu.ar/cacic.html, accessed in July 2015.
 Workshop de Investigadores en Ciencia de la Computación, http://redunci.info.unlp.edu.ar/wicc.html,
accessed in July 2015.