Index structures for distributed text databases

Authors

  • Juan Mauricio Marin Cahiuan Departamento de Computación, Universidad de Magallanes, Punta Arenas, Chile

Abstract

The Web has became an obiquitous resource for distributed computing making it relevant to investigate new ways of providing efficient access to services available at dedicated sites. Efficiency is an ever-increasing demand which can be only satisfied with the development of parallel algorithms which are efficient in practice. This tutorial paper focuses on the design, analysis and implementation of parallel algorithms and data structures for widely-used text database applications on the Web. In particular we describe parallel algorithms for inverted files and suffix arrays structures that are suitable for implementing search engines. Algorithmic design is effected on top of the BSP model of parallel computing. This model ensures portability across diverse parallel architectures ranging from clusters to super-computers.

Downloads

Download data is not yet available.

References

[1] M.D. Araujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Workshop on String Processing (WSP’97), pages 2–20. (Carleton University Press), 1997.
[2] C. Badue, R. Baeza-Yates, B. Ribeiro, and N. Ziviani. Distributed query processing using partitioned inverted files. In Eighth Symposium on String Processing and Information Retrieval (SPIRE’01), pages 10–20. (IEEE CS Press), Nov. 2001.
[3] R. Baeza and B. Ribeiro. Modern Information Retrieval. Addison-Wesley., 1999.
[4] R. Baeza-Yates, A. Moffat, and G. Navarro. Handbook of Massive Data Sets. Kluwer Academic Publishers, 2002. Chapter 7, pages 195–243.
[5] N. Barghouti and G. Kaiser. Concurrency control in advanced database applications. ACM Computing Surveys, 23(3), September 1991.
[6] B. K. Bhargava. Concurrency control in database systems. Knowledge and Data Engineering, 11(1):3–16, 1999.
[7] S. Blott and H. Korth. An almost-serial protocol for transaction execution in main-memory database systems. In 28th International Conference on Very Large Data Bases, Aug. 2002. Hong Kong, China.
[8] N. Brisaboa, E. Iglesias, G. Navarro, and J. Paramá. An efficient compression code for text databases. In Proceedings of the 25th European Conference on Information Retrieval Research (ECIR’03), LNCS 2633, pages 468–481, 2003.
[9] S.H. Chung, H.C. Kwon, K.R. Ryu, H.K. Jang, J.H. Kim, and C.A. Choi. Parallel information retrieval on a sci-based pc-now. In Workshop on Personal Computers based Networks of Workstations (PC-NOW 2000). (Springer-Verlag), May 2000.
[10] E. Silva de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113–139, April 2000.
[11] Daniela Florescu, Alon Y. Levy, and Alberto O. Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59–74, 1998.
[12] D.R. Jefferson. Virtual time. ACM Trans. Prog. Lang. and Syst., 7(3):404–425, July 1985.
[13] M. Kamath and K. Ramamritham. Efficient transaction management and query processing in massive digital databases. Technical Report UM-CS-1995-093, University of Massachusetts, , 1995.
[14] J. Kitajima and G. Navarro. A fast distributed suffix array generation algorithm. In 6th International Symposium on String Processing and Information Retrieval, pages 97–104, 1999.
[15] M. Marín. Time Warp on BSP Computers. In 12th European Simulation Multiconference, June 1998.
[16] M. Marín. Parallel text query processing using Composite Inverted Lists. In Second International Conference on Hybrid Intelligent Systems (Web Computing Session). IO Press, Feb. 2003.
[17] M. Marín and G. Navarro. Distributed query processing using suffix arrays. In Int. Conf. on String Processing and Information Retrieval, Lecture Notes in Computer Science. Springer-Verlag, Sept. 2003. (to appear).
[18] M. Marín and G. Navarro. Suffix arrays in parallel. In International Conference on Parallel Processing (EuroPar’03), Lecture Notes in Computer Science, page (to appear). Springer-Verlag, Aug. 2003.
[19] A.A. MacFarlane, J.A. McCann, and S.E. Robertson. Parallel search using partitioned inverted files. In 7th International Symposium on String Processing and Information Retrieval, pages 209–220. (IEEE CS Press), 2000.
[20] W.F. McColl. General purpose parallel computing. In A.M. Gibbons and P. Spirakis, editors, Lectures on Parallel Computation, pages 337–391. Cambridge University Press, 1993.
[21] A. Moffat. Word-based text compression. Software – practice and experience, 19(2):185–198, 1989.
[22] A. Moffat and A. Turpin. On the implementation of minimum-redundancy prefix codes. In Data Compression Conference, pages 170–179, 1996.
[23] G. Navarro, E. Silva de Moura, M. Neubert, N. Ziviani, and R. Baeza-Yates. Adding compression to block addressing inverted indexes. Information retrieval, 3(1):49–77, 2000.
[24] G. Navarro, J. Kitajima, B. Ribeiro, and N. Ziviani. Distributed generation of suffix arrays. In 8th Annual Symposium on Combinatorial Pattern Matching, pages 102–115. (LNCS 1264), 1997.
[25] M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10):749–764, 1996.
[26] B. Ribeiro-Neto, J. Kitajima, G. Navarro, C. Santana, and N. Ziviani. Parallel generation of inverted lists for distributed text collections. In XVIII Conference of the Chilean Computer Science Society, pages 149–157. (IEEE CS Press), 1998.
[27] B.A. Ribeiro-Neto and R.A. Barbosa. Query performance for tightly coupled distributed digital libraries. In Third ACM Conference on Digital Libraries, pages 182–190. (ACM Press), 1998.
[28] S.Dandamudi and J.Jain. Architectures for parallel query processing on networks of workstation. In 1997 International Conference on Parallel and Distributed Computing Systems, Oct. 1997.
[29] D.B. Skillicorn, J.M.D. Hill, and W.F. McColl. Questions and answers about BSP. Technical Report PRG-TR-15-96, Computing Laboratory, Oxford University, 1996. Also in Journal of Scientific Programming,V.6 N.3, 1997.
[30] D.B. Skillicorn and D. Talia. Models and languages for parallel computation. ACM Computing Surveys, 20(2):123–169, June 1998.
[31] T. Tamura, M. Oguchi, and M. Kitsuregawa. Parallel database processing on a 100 node pc cluster: Cases for decision support query processing and data mining. In SC’97, 1997.
[32] A. Tomasic and H. Garcia-Molina. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Second International Conference on Parallel and Distributed Information Systems, pages 8–17, 1993.
[33] URL. BSP and Worldwide Standard, http://www.bsp-worldwide.org/.
[34] URL. BSP PUB Library at Paderborn University, http://www.uni-paderborn.de/bsp.
[35] L.G. Valiant. A bridging model for parallel computation. Comm. ACM, 33:103–111, Aug. 1990.
[36] C. Yu and C. Chang. Distributed query processing. ACM Computing Surveys, 16(4):399–433, December 1984

Downloads

Published

2004-04-01

How to Cite

Marin Cahiuan, J. M. (2004). Index structures for distributed text databases. Journal of Computer Science and Technology, 4(01), p. 1–12. Retrieved from https://journal.info.unlp.edu.ar/JCST/article/view/907

Issue

Section

Invited Articles

Most read articles by the same author(s)