Improving Real Time Search Performance using Inverted Index Entries Invalidation Strategies
The impressive rise of user-generated content on the web in the hands of sites like Twitter imposes new challenges to search systems. The concept of real-time search emerges, increasing the role that efficient indexing and retrieval algorithms play in this scenario. Thousands of new updates need to be processed in the very moment they are generated and users expect content to be “searchable” within seconds. This lead to the develop of efficient data structures and algorithms that may face this challenge efficiently. In this work, we introduce the concept of index entry invalidator, a strategy responsible for keeping track of the evolution of the underlying vocabulary and selectively invalidate and evict those inverted index entries that do not considerably degrade retrieval effectiveness. Consequently, the index becomes smaller and may increase overall efficiency. We introduce and evaluate two approaches based on Time-to-Live and Sliding Windows criteria. We also study the dynamics of the vocabulary using a real dataset while the evaluation is carry out using a search engine specifically designed for real-time indexing and search.
 H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social network or a news media?,” in Proceedings of the 19th International Conference on World Wide Web, WWW ’10, 2010.
 C. Chen, F. Li, B. C. Ooi, and S. Wu, “Ti: An efficient indexing mechanism for real-time search on tweets,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, 2011.
 M. Efron, “Information search and retrieval in microblogs,” J. Am. Soc. Inf. Sci. Technol.,vol. 62, no. 6, 2011.
 J. Teevan, D. Ramage, and M. R. Morris, “#twittersearch: A comparison of microblog search and web search,” in Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, 2011.
 S. Nepomnyachiy, B. Gelley, W. Jiang, and T. Minkus, “What, where, and when: Keyword search with spatio-temporal ranges,” in Proceedings of the 8th Workshop on Geographic Information Retrieval, GIR ’14, 2014.
 M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, “Earlybird: Real-time search at twitter,” in Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, ICDE ’12, 2012.
 N. Asadi, J. Lin, and M. Busch, “Dynamic memory allocation policies for postings in real-time twitter search,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, 2013.
 A. Java, X. Song, T. Finin, and B. Tseng, “Why we twitter: Understanding microblogging usage and communities,” in Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, WebKDD/SNA-KDD ’07, 2007.
 R. McCreadie, I. Soboroff, J. Lin, C. Macdonald, I. Ounis, and D. McCullough, “On building a reusable twitter corpus,” in Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, 2012.
 G. Pass, A. Chowdhury, and C. Torgeson, “A picture of search,” in Proceedings of the 1st International Conference on Scalable Information Systems, InfoScale ’06, 2006.
 E. Ríssola and G. Tolosa, “Inverted index entry invalidation strategy for real time search,” in Proceedings of the XXI Congreso Argentino de Ciencias de la Computación, CACIC ’15, 2015.
 C. D. Manning, P. Raghavan, and H. Sch ̈utze, Introduction to Information Retrieval. Cambridge University Press, 2008.
 R. Blanco, E. Bortnikov, F. Junqueira, R. Lempel, L. Telloli, and H. Zaragoza, “Caching search engine results over incremental indices,” in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, 2010.
 N. Asadi and J. Lin, “Fast candidate generation for real-time tweet search with bloom filter chains,” ACM Trans. Inf. Syst., vol. 31, no. 3, 2013.
 J. Zobel and A. Moffat, “Inverted files for text search engines,” ACM Comput. Surv., vol. 38, no. 2, 2006.
 R. A. Baeza-Yates and B. A. Ribeiro-Neto, Modern Information Retrieval - the concepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England, 2011.
 I. H. Witten, A. Moffat, and T. C. Bell, Managing Gigabytes (2Nd Ed.): Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers Inc., 1999.
 N. Asadi and J. Lin, “An exploration of postings list contiguity in main-memory incremental indexing,” LSDS-IR ’14, 2014.
 C. E. Grant, C. P. George, C. Jenneisch, and J. N. Wilson, “Online topic modeling for real-time twitter search,” in Proceedings of The Twentieth Text REtrieval Conference, TREC 2011, Gaithersburg, Maryland, USA, November 15-18, 2011, 2011.
 J. Choi and W. B. Croft, “Temporal models for microblogs,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, 2012.
 D. Metzler, C. Cai, and E. Hovy, “Structured event retrieval over microblog archives,” in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, 2012.
 I. Soboroff, I. Ounis, C. Macdonald, and J. Lin, “Overview of the trec-2012 microblog track,” in In Proceedings of TREC 2012, 2012.
 J. Lin and G. Mishne, “A study of ”churn” in tweets and real-time search queries,” in Proceedings of the Sixth International Conference on Weblogs and Social Media, Dublin, Ireland, June 4-7, 2012, 2012.
 B. B. Cambazoglu, F. P. Junqueira, V. Plachouras, S. Banachowski, B. Cui, S. Lim, and B. Bridge, “A refreshing perspective of search engine caching,” in Proceedings of the 19th International Conference on World Wide Web, WWW ’10, 2010.
 A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien, “Efficient query evaluation using a two-level retrieval process,” in Proceedings of the Twelfth International Conference on Information and Knowledge Management, CIKM ’03, 2003.
 D. McCullough, J. Lin, C. Macdonald, I. Ounis, and R. McCreadie, “Evaluating real-time search over tweets,” in Proceedings of the Sixth International Conference on Weblogs and Social Media, Dublin, Ireland, June 4-7, 2012, 2012.