Cross domain author profiling task in spanish language: an experimental study
Keywords:Author Profiling, Natural Processing Language, Cross Domain Classification, Spanish Language, Text Mining
Author Profiling is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to the potential applications in security, crime detection and marketing, among others. An interesting point is to study the robustness of a classifier when it is trained with a data set and tested with others containing different characteristics. Commonly this is called cross domain experimentation. Although different cross domain studies have been done for data sets in English language, for Spanish it has recently begun. In this context, this work presents a study of cross domain classification for the author profiling task in Spanish. The experimental results showed that using corpora with different levels of formality we can obtain robust classifiers for the author profiling task in Spanish language.
 Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. Gender, genre, and writing style in formal written texts. Text - Interdisciplinary Journal for the Study of Discourse, 23(3):321–346, 2003.
 Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W Pennebaker. Effects of age and gender on blogging. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, 6:199–205, 2006.
 Moshe Koppel, Jonathan Schler, and Kfir Zigdon. Determining an author’s native language by mining a text for errors. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 624–628. ACM, 2005.
 Francisco Rangel. Author profile in social media: Identifying information about gender, age, emotions and beyond. In Proceedings of the 5th BCS IRSG Symposium on Future Directions in Information Access, pages 58–60, 2013.
 Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135, 2008.
 J.W. Pennebaker, M.R. Mehl, and K.G. Niederhoffer. Psychological aspects of natural language use: Our words, our selves. Annual review of psychology, 54(1):547–577, 2003.
 Shlomo Argamon, Moshe Koppel, James W Pennebaker, and Jonathan Schler. Automatically profiling the author of an anonymous text. Communications of the ACM, 52(2):119–123, 2009.
 M Ramakrishna Murty, JVR Murthy, PVGD Prasad Reddy, and SC Satapathy. A survey of cross-domain text categorization techniques. In Recent Advances in Information Technology (RAIT), 2012 1st International Conference on, pages 499–504. IEEE, 2012.
 Lianghao Li, Xiaoming Jin, and Mingsheng Long. Topic correlation analysis for cross-domain text classification. In Twenty-Sixth AAAI Conference on Artificial Intelligence, pages 998–1004, 2012.
 Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen. Cross-domain sentiment classification via spectral feature alignment. In Proceedings of the 19th international conference on World wide web, pages 751–760. ACM, 2010.
 Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, and Walter Daelemans. Overview of the 2nd author profiling task at pan 2014. In CLEF 2014 Labs and Workshops, Notebook Papers, pages 898–827, 2014.
 Marı́a Paula Villegas, Marı́a José Garciarena Ucelay, Marcelo Luis Errecalde, and Leticia Cagnina. A spanish text corpus for the author profiling task. In XX Congreso Argentino de Ciencias de la Computación, pages 621–630, 2014.
 Francisco Rangel, Paolo Rosso, Moshe Moshe Koppel, Efstathios Stamatatos, and Giacomo Inches. Overview of the author profiling task at pan 2013. In CLEF Conference on Multilingual and Multimodal Information Access Evaluation, pages 352–365, 2013.
 Adrián Pastor López-Monroy, Manuel Montes-y-Gómez, Hugo Jair Escalante, Luis Villaseñor Pineda, and Esaú Villatoro-Tello. inaoe’s participation at pan’13: Author profiling task notebook for PAN at CLEF 2013. In Working Notes for CLEF 2013 Conference, Spain, 2013.
 Aidan Finn, Nicholas Kushmerick, and Barry Smyth. Genre classification and domain transfer for information filtering. In Advances in information retrieval, pages 353–362. Springer, 2002.
 Ronen Feldman and James Sanger. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, 2007.
 William B. Cavnar and John M. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161–175. Citeseer, 1994.
 Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
 Jane Lin. Automatic author profiling of online chat logs. PhD thesis, Monterey, California. Naval Postgraduate School, 2007.
 Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9:1871–1874, 2008.
 Dario G Funez, Leticia Cagnina, and Marcelo Luis Errecalde. Determinación de género y edad en blogs en español mediante enfoques basados en perfil. In XVIII Congreso Argentino de Ciencias de la Computación, pages 984–993, 2013.