Towards Information Quality Assurance in Spanish: Wikipedia

Authors

  • Edgardo Ferretti Departamento de Informática, Universidad Nacional de San Luis (UNSL), San Luis, 5700, Argentina
  • Matías Soria Departamento de Informática, Universidad Nacional de San Luis (UNSL), San Luis, 5700, Argentina
  • Sebastián Pérez Casseignau Departamento de Informática, Universidad Nacional de San Luis (UNSL), San Luis, 5700, Argentina
  • Lian Pohn Departamento de Informática, Universidad Nacional de San Luis (UNSL), San Luis, 5700, Argentina
  • Guido Urquiza Departamento de Informática, Universidad Nacional de San Luis (UNSL), San Luis, 5700, Argentina
  • Sergio Alejandro Gómez Comisión de Investigaciones Científicas de la Provincia de Buenos Aires (CIC-PBA)
  • Marcelo Luis Errecalde Departamento de Informática, Universidad Nacional de San Luis (UNSL), San Luis, 5700, Argentina

Keywords:

featured article identification, information quality, quality flaws prediction, Wikipedia

Abstract

Featured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm.

Downloads

Download data is not yet available.

References

[1] R. Wang and D. Strong, “Beyond accuracy: what data quality means to data consumers,” Journal of management information systems, vol. 12, no. 4, pp. 5–33, 1996.
[2] Wikipedia, “Featured article criteria.” http://en.wikipedia.org/wiki/Wikipedia: Featured_article_criteria, cited January 2017.
[3] M. Anderka and B. Stein, “A breakdown of quality flaws in Wikipedia,” in 2nd joint WICOW/AIRWeb workshop on Web quality (WebQuality’12), pp. 11–18, ACM, 2012.
[4] Wikipedia, “Featured articles.” https://en.wikipedia.org/wiki/Wikipedia: Featured_articles, cited January 2017.
[5] A. Lih, “Wikipedia as participatory journalism: reliable sources? Metrics for evaluating collaborative media as a news resource,” in 5th Intl. Symp. on online journalism, 2004.
[6] B. Stvilia, M. Twidale, L. Smith, and L. Gasser, “Assessing information quality of a community-based encyclopedia,” in 10th Intl. Conf. on Information Quality, 2005.
[7] J. Blumenstock, “Size matters: word count as a measure of quality on Wikipedia,” in 17th international conference on World Wide Web, pp. 1095–1096, ACM, 2008.
[8] N. Lipka and B. Stein, “Identifying featured articles in Wikipedia: writing style matters,” in 19th international conference on World Wide Web, pp. 1147–1148, ACM, 2010.
[9] M. Anderka, B. Stein, and N. Lipka, “Predicting Quality Flaws in User-generated Content: The Case of Wikipedia,” in 35rd Annual intl. ACM SIGIR conf. on research and development in information retrieval, ACM, 2012.
[10] E. Ferretti, M. Errecalde, M. Anderka, and B. Stein, “On the use of reliable-negatives selection strategies in the pu learning approach for quality flaws prediction in wikipedia,” in 11th Intl. Workshop on Text-based Information Retrieval, 2014.
[11] Alexa, “wikipedia.org traffic statistics.” http://www.alexa.com/siteinfo/wikipedia.org, cited January 2017.
[12] Wikipedia, “List of wikipedias.”https://meta.wikimedia.org/wiki/List_of_Wikipedias, cited January 2017.
[13] L. Pohn, E. Ferretti, and M. Errecalde, Computer Science & Technology Series: XX Argentine Congress of Computer Science - selected papers, ch. Identifying featured articles in Spanish Wikipedia. EDULP, 2015.
[14] G. Urquiza, M. Soria, S. Perez-Casseignau, E. Ferretti, S. A. Gómez, and M. Errecalde, “On the Assessment of Information Quality in Spanish Wikipedia,” in Actas del XXII Congreso Argentino de Ciencias de la Computación, pp. 702–711, Nueva Editorial Universitaria, UNSL, 2016. ISBN 978-987-733-072-4.
[15] G. Druck, G. Miklau, and A. McCallum, “Learning to predict the quality of contributions to wikipedia,” WikiAI, vol. 8, 2008.
[16] R. Layton, P. Watters, and R. Dazeley, “Recentred local profiles for authorship attribution,” Natural Language Engineering, vol. 18, pp. 293–312, Jul 2012.
[17] M. Anderka, B. Stein, and N. Lipka, “Towards Automatic Quality Assurance in Wikipedia,” in 20th intl. conference on World Wide Web, pp. 5–6, ACM, 2011.
[18] C. Fricke, “Featured article identification in wikipedia.” Bachelor Thesis, Bauhaus-Universität Weimar, 2012.
[19] D. H. Dalip, M. A. Gonçalves, M. Cristo, and P. Calado, “Automatic assessment of document quality in web collaborative digital libraries,” Journal of Data and Information Quality, vol. 2, pp. 1–30, Dec. 2011.
[20] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: An update,” SIGKDD Explorations, vol. 11, no. 1, 2009.
[21] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011.
[22] O. Ferschke, I. Gurevych, and M. Rittberger., “FlawFinder: a modular system for predicting quality flaws in Wikipedia,” in Notebook papers of CLEF 2012 labs and workshops, 2012.
[23] M. Anderka, Analyzing and Predicting Quality Flaws in User-generated Content: The Case of Wikipedia. PhD thesis, Bauhaus-Universität Weimar, June 2013.

Downloads

Published

2017-04-01

How to Cite

Ferretti, E., Soria, M., Pérez Casseignau, S., Pohn, L., Urquiza, G., Gómez, S. A., & Errecalde, M. L. (2017). Towards Information Quality Assurance in Spanish: Wikipedia. Journal of Computer Science and Technology, 17(01), p. 29–36. Retrieved from https://journal.info.unlp.edu.ar/JCST/article/view/453

Issue

Section

Original Articles

Most read articles by the same author(s)