Keyword identification in Spanish documents using neural networks

Authors

  • Germán Osvaldo Aquino Instituto de Investigación en Informática LIDI, Facultad de Informática, Universidad Nacional de La Plata
  • Laura Cristina Lanzarini Instituto de Investigación en Informática LIDI, Facultad de Informática, Universidad Nacional de La Plata

Keywords:

keyword extraction, autoencoders, neural networks

Abstract

The large amount of textual information digitally available today gives rise to the need for effective means of indexing, searching and retrieving this information. Keywords are used to describe briefly and precisely the contents of a textual document. In this paper we present an algorithm for keyword extraction from documents written in Spanish.This algorithm combines autoencoders, which are adequate for highly unbalanced classification problems, with the discriminative power of conventional binary classifiers. In order to improve its performance on larger and more diverse datasets, our algorithm trains several models of each kind through bagging.

Downloads

Download data is not yet available.

References

[1] Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., Frank, E.: Improving Browsing in Digital Libraries with Keyphrase Indexes. Journal of Decision Support Systems, Vol.27, no 1-2, pp.81--104. (1999)
[2] Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval, vol. 2,303--336 (2000).
[3] Witten, I. H., Paynter, G. W., Frank, E., Gutwin C., Neville-Manning, C. G.: KEA: Practical Automatic Keyphrase Extraction. In Proceedings of the 4th ACM Conference on Digital Libraries, pp. 254--255 (1998).
[4] Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in NLP, pp. 216--223 (2003).
[5] Hasan, K. S., Ng V.: Automatic Keyphrase Extraction: A Survey of the State of the Art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1262--1273 (2014).
[6] Medelyan, O.: Human-competitive automatic topic indexing. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1318--1327, Association for Computational Linguistics (2009).
[7] Kim, S. N., Medelyan, O., Kan, M., Baldwin, T. SemEval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles. In Proceedings of the 5th International Workshop on Semantic Evaluation. pp. 21--26 (2010).
[8] WEKA, http://www.cs.waikato.ac.nz/ml/weka/, accessed in July 2015.
[9] Aquino, G, Hasperué, W, Lanzarini, L. Keyword Extraction using Auto-associative Neural Networks. XX Congreso Argentino en Ciencias de la Computación (2014).
[10] Japkowicz, N, Myers, C, Gluck, M.: A Novelty Detection Approach to Classification. Proceedings of the Fourteenth Joint Conference on Artificial Intelligence, pp. 518--523 (1995).
[11] Breiman, L.: Bagging Predictors. Machine Learning, pp. 123--140 (1996).
[12] Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI), pp. 111--117 (2000).
[13] Fürnkranz, J.: A Study Using n-gram Features for Text Categorization (1998).
[14] OpenNLP, http://opennlp.apache.org/, accessed in July 2015.
[15] Conference on Computational Natural Language Learning (CoNLL-2002), http://www.clips.ua.ac.be/conll2002/ner/, accessed in July 2015.
[16] Expert Advisory Group on Language Engineering Standards (EAGLES), http://www.ilc.cnr.it/EAGLES96/home.html, accessed in July 2015.
[17] Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management, pp. 513--523 (1988).
[18] Andrade, M.A., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics, vol. 14, no. 7, pp. 600--607 (1998).
[19] Riedmiller, M.: Advanced Supervised Learning in Multi-layer Perceptrons - From Backpropagation to Adaptive Learning Algorithms (1994).
[20] Congreso Argentino en Ciencias de la Computación, http://redunci.info.unlp.edu.ar/cacic.html, accessed in July 2015.
[21] Workshop de Investigadores en Ciencia de la Computación, http://redunci.info.unlp.edu.ar/wicc.html,
accessed in July 2015.

Downloads

Published

2015-11-01

How to Cite

Aquino, G. O., & Lanzarini, L. C. (2015). Keyword identification in Spanish documents using neural networks. Journal of Computer Science and Technology, 15(02), p. 55–60. Retrieved from https://journal.info.unlp.edu.ar/JCST/article/view/554

Issue

Section

Original Articles

Most read articles by the same author(s)

1 2 > >>