Information Retrieval – Bibliography

Information Retrieval: Implementing and Evaluating Search Engines

Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack
MIT Press, 2010

Bibliography

Chapter 1: Introduction

Baeza-Yates, R. A., and Ribeiro-Neto, B. (2010). Modern Information Retrieval (2nd ed.). Reading, Massachusetts: Addison-Wesley.
Croft, W. B., Metzler, D., and Strohman, T. (2010). Search Engines: Information Retrieval in Practice. London, England: Pearson.
Grossman, D. A., and Frieder, O. (2004). Information Retrieval: Algorithms and Heuristics (2nd ed.). Berlin, Germany: Springer.
Hearst, M. A. (2009). Search User Interfaces. Cambridge, England: Cambridge University Press.
Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge, England: Cambridge University Press.
Manning, C. D., and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: MIT Press.
Özsu, M. T., and Liu, L., editors (2009). Encyclopedia of Database Systems. Berlin, Germany: Springer.
Salton, G. (1968). Automatic Information Organziation and Retrieval. New York: McGraw-Hill.
van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London, England: Butterworths.
Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images (2nd ed.). San Francisco, California: Morgan Kaufmann.
Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Cambridge, Massachusetts: Addison-Wesley.

Chapter 2: Basic Techniques

Baeza-Yates, R. (2004). A fast set intersection algorithm for sorted sequences. In Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, pages 400–408. Istanbul, Turkey.
Barbay, J., and Kenyon, C. (2002). Adaptive intersection and t-threshold problems. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 390–399. San Francisco, California.
Bentley, J. L., and Yao, A. C. C. (1976). An almost optimal algorithm for unbounded searching. Information Processing Letters, 5(3):82–87.
Buckley, C. (2005). The SMART project at TREC. In Voorhees, E. M., and Harman, D. K., editors, TREC — Experiment and Evaluation in Information Retrieval, chapter 13, pages 301–320. Cambridge, Massachusetts: MIT Press.
Buckley, C., Salton, G., Allan, J., and Singhal, A. (1994). Automatic query expansion using SMART: TREC 3. In Proceedings of the 3rd Text REtrieval Conference. Gaithersburg, Maryland.
Clarke, C. L. A., Cormack, G. V., and Tudhope, E. A. (2000). Relevance ranking for one to three term queries. Information Processing & Management, 36(2):291–311.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407.
Demaine, E. D., López-Ortiz, A., and Munro, J. I. (2000). Adaptive set intersections, unions, and differences. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 743–752. San Francisco, California.
Faloutsos, C. (1985). Access methods for text. ACM Computing Surveys, 17(1):49–74.
Faloutsos, C., and Christodoulakis, S. (1984). Signature files: An access method for documents and its analytical performance evaluation. ACM Transactions on Office Information Systems, 2(4):267–288.
Gonnet, G. H. (1987). Pat 3.1 — An Efficient Text Searching System — User's Manual. University of Waterloo, Canada.
Gonnet, G. H., Baeza-Yates, R. A., and Snider, T. (1992). New indices for text — pat trees and pat arrays. In Frakes, W. B., and Baeza-Yates, R., editors, Information Retrieval — Data Structures and Algorithms, chapter 5, pages 66–82. Englewood Cliffs, New Jersey: Prentice Hall.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 50–57. Berkeley, California.
Knuth, D. E. (1973). The Art of Computer Programming, volume 3. Reading, Massachusetts: Addison-Wesley.
Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4):309–317.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2):159–165.
Manber, U., and Myers, G. (1990). Suffix arrays: A new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, pages 319–327. San Francisco, California.
Salton, G. (1968). Automatic Information Organziation and Retrieval. New York: McGraw-Hill.
Singhal, A., Salton, G., Mitra, M., and Buckley, C. (1996). Document length normalization. Information Processing & Management, 32(5):619–633.
van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London, England: Butterworths.
Weiner, P. (1973). Linear pattern matching algorithm. In Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pages 1–11. Iowa City, Iowa.
Zobel, J., Moffat, A., and Ramamohanarao, K. (1998). Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4):453–490.

Chapter 3: Tokens and Terms

Asian, J., Williams, H. E., and Tahaghoghi, S. M. M. (2005). Stemming Indonesian. In Proceedings of the 28th Australasian Computer Science Conference, pages 307–314. Newcastle, Australia.
Beitzel, S., Jensen, E., and Grossman, D. (2002). Retrieving OCR text: A survey of current approaches. In Proceedings of the SIGIR 2002 Workshop on Information Retrieval and OCR: From Converting Content to Grasping Meaning. Tampere, Finland.
Braschler, M., and Ripplinger, B. (2004). How effective is stemming and decompounding for German text retrieval? Information Retrieval, 7(3-4):291–316.
Brill, E., and Moore, R. C. (2000). An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 286–293. Hong Kong, China.
Creutz, M., and Lagus, K. (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pages 21–30.
Cucerzan, S., and Brill, E. (2004). Spelling correction as an iterative process that exploits the collective knowledge of Web users. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 293–300.
Frakes, W. B. (1984). Term conflation for information retrieval. In Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 383–389. Cambridge, England.
Fujii, H., and Croft, W. B. (1993). A comparison of indexing techniques for Japanese text retrieval. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 237–246. Pittsburgh, Pennsylvania.
Gey, F. C., and Oard, D. W. (2001). The TREC-2001 cross-language information retrieval track: Searching Arabic using English, French or Arabic queries. In Proceedings of the 10th Text REtrieval Conference, pages 16–25. Gaithersburg, Maryland.
Gore, A. (2006). An Inconvenient Truth. Emmaus, Pennsylvania: Rodale.
Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15.
Hull, D. A. (1996). Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science, 47(1):70–84.
Jain, A., Cucerzan, S., and Azzam, S. (2007). Acronym-expansion recognition and ranking on the Web. In Proceedings of the IEEE International Conference on Information Reuse and Integration, pages 209–214. Las Vegas, Nevada.
Kraaij, W., and Pohlmann, R. (1996). Using Linguistic Knowledge in Information Retrieval. Technical Report OTS-WP-CL-96-001. Research Institute for Language and Speech, Utrecht University.
Kukich, K. (1992). Technique for automatically correcting words in text. ACM Computing Surveys, 24(4):377–439.
Larkey, L. S., Ballesteros, L., and Connell, M. E. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 275–282. Tampere, Finland.
Lazarinis, F., Vilares, J., Tait, J., and Efthimiadis, E. N. (2009). Introduction to the special issue on non-English Web retrival. Information Retrieval, 12(3).
Li, M., Zhu, M., Zhang, Y., and Zhou, M. (2006). Exploring distributional similarity based models for query spelling correction. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 1025–1032. Sydney, Australia.
Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1–2):22–31.
Luk, R. W. P., and Kwok, K. L. (2002). A comparison of Chinese document indexing strategies and retrieval models. ACM Transactions on Asian Language Information Processing, 1(3):225–268.
Majumder, P., Mitra, M., Parui, S. K., Kole, G., Mitra, P., and Datta, K. (2007). YASS: Yet another suffix stripper. ACM Transactions on Information Systems, 25(4):article 18.
Manning, C. D., and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: MIT Press.
McNamee, P. (2008). Retrieval experiments at Morpho Challenge 2008. In Cross-Language Evaluation Forum. Aarhus, Denmark.
McNamee, P., and Mayfield, J. (2004). Character n-gram tokenization for European language text retrieval. Information Retrieval, 7(1-2):73–97.
McNamee, P., Nicholas, C., and Mayfield, J. (2008). Don't have a stemmer?: Be un+concern+ed. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 813–814. Singapore.
Nwesri, A. F. A., Tahaghoghi, S. M. M., and Scholer, F. (2005). Stemming Arabic conjunctions and prepositions. In Proceedings of the 12th International Conference on String Processing and Information Retrieval, pages 206–217. Buenos Aires, Agentina.
Paice, C. D. (1990). Another stemmer. ACM SIGIR Forum, 24(3):56–61.
Peng, F., Huang, X., Schuurmans, D., and Cercone, N. (2002). Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR. In Proceedings of the 19th International Conference on Computational Linguistics. Taipei, Taiwan.
Pike, R., and Thompson, K. (1993). Hello world. In Proceedings of the Winter 1993 USENIX Conference, pages 43–50. San Diego, California.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130–137.
Ruch, P. (2002). Information retrieval and spelling correction: An inquiry into lexical disambiguation. In Proceedings of the 2002 ACM Symposium on Applied Computing, pages 699–703. Madrid, Spain.
Trask, L. (2004). What is a Word? Technical Report LxWP11/04. Department of Linguistics and English Language, University of Sussex, United Kingdom.
Voorhees, E. M., and Harman, D. K. (2005). The Text REtrieval Conference. In Voorhees, E. M., and Harman, D. K., editors, TREC — Experiment and Evaluation in Information Retrieval, chapter 1, pages 3–20. Cambridge, Massachusetts: MIT Press.

Chapter 4: Static Inverted Indices

Bender, M., Michel, S., Triantafillou, P., and Weikum, G. (2007). Design alternatives for large-scale Web search: Alexander was great, Aeneas a pioneer, and Anakin has the force. In Proceedings of the 1st Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR), pages 16–22. Amsterdam, The Netherlands.
Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426.
Brin, S., and Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107–117.
Büttcher, S., and Clarke, C. L. A. (2005). Memory Management Strategies for Single-Pass Index Construction in Text Retrieval Systems. Technical Report CS-2005-32. University of Waterloo, Waterloo, Canada.
Carterette, B., and Can, F. (2005). Comparing inverted files and signature files for searching a large lexicon. Information Processing & Management, 41(3):613–633.
Clark, D. R., and Munro, J. I. (1996). Efficient suffix trees on secondary storage. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 383–391. Atlanta, Georgia.
Faloutsos, C., and Christodoulakis, S. (1984). Signature files: An access method for documents and its analytical performance evaluation. ACM Transactions on Information Systems, 2(4):267–288.
Heinz, S., and Zobel, J. (2003). Efficient single-pass index construction for text databases. Journal of the American Society for Information Science and Technology, 54(8):713–729.
Heinz, S., Zobel, J., and Williams, H. E. (2002). Burst tries: A fast, efficient data structure for string keys. ACM Transactions on Information Systems, 20(2):192–223.
Luk, R. W. P., and Lam, W. (2007). Efficient in-memory extensible inverted file. Information Systems, 32(5):733–754.
Manber, U., and Myers, G. (1990). Suffix arrays: A new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, pages 319–327. San Francisco, California.
Moffat, A., and Bell, T. A. H. (1995). In-situ generation of compressed inverted files. Journal of the American Society for Information Science, 46(7):537–550.
Moffat, A., and Zobel, J. (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349–379.
Rao, J., and Ross, K. A. (1999). Cache conscious indexing for decision-support in main memory. In Proceedings of 25th International Conference on Very Large Data Bases, pages 78–89. Edinburgh, Scotland.
Rao, J., and Ross, K. A. (2000). Making B^+-trees cache conscious in main memory. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 475–486. Dallas, Texas.
Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 14(3):249–260.
Weiner, P. (1973). Linear pattern matching algorithm. In Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pages 1–11. Iowa City, Iowa.
Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images (2nd ed.). San Francisco, California: Morgan Kaufmann.
Zobel, J., Heinz, S., and Williams, H. E. (2001). In-memory hash tables for accumulating text vocabularies. Information Processing Letters, 80(6):271–277.
Zobel, J., and Moffat, A. (2006). Inverted files for text search engines. ACM Computing Surveys, 38(2):1–56.
Zobel, J., Moffat, A., and Ramamohanarao, K. (1998). Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4):453–490.

Chapter 5: Query Processing

Anh, V. N., de Kretser, O., and Moffat, A. (2001). Vector-space ranking with effective early termination. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 35–42. New Orleans, Louisiana.
Anh, V. N., and Moffat, A. (2004). Collection-independent document-centric impacts. In Proceedings of the 9th Australasian Document Computing Symposium, pages 25–32. Melbourne, Australia.
Anh, V. N., and Moffat, A. (2006). Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 372–379. Seattle, Washington.
Boldi, P., and Vigna, S. (2006). Efficient lazy algorithms for minimal-interval semantics. In String Processing and Information Retrieval, 13th International Conference, pages 134–149. Glasgow, Scotland.
Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., and Zien, J. (2003). Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th International Conference on Information and Knowledge Management, pages 426–434. New Orleans, Louisiana.
Burkowski, F. J. (1992). An algebra for hierarchically organized text-dominated databases. Information Processing & Management, 28(3):333–348.
Büttcher, S., and Clarke, C. L. A. (2006). A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 182–189. Arlington, Virginia.
Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y., and Soffer, A. (2001). Static index pruning for information retrieval systems. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–50. New Orleans, Louisiana.
Carpineto, C., de Mori, R., Romano, G., and Bigi, B. (2001). An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems, 19(1):1–27.
Clarke, C. L. A., and Cormack, G. V. (2000). Shortest-substring retrieval and ranking. ACM Transactions on Information Systems, 18(1):44–78.
Clarke, C. L. A., Cormack, G. V., and Burkowski, F. J. (1995a). An algebra for structured text search and a framework for its implementation. Computer Journal, 38(1):43–56.
Clarke, C. L. A., Cormack, G. V., and Burkowski, F. J. (1995b). Schema-independent retrieval from heterogeneous structured text. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, pages 279–289. Las Vegas, Nevada.
Consens, M. P., and Milo, T. (1995). Algebras for querying text regions. In Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 11–22. San Jose, California.
Dao, T., Sacks-Davis, R., and Thom, J. A. (1996). Indexing structured text for queries on containment relationships. In Proceedings of the 7th Australasian Database Conference, pages 82–91. Melbourne, Australia.
Gonnet, G. H. (1987). Pat 3.1 — An Efficient Text Searching System — User's Manual. University of Waterloo, Canada.
Hawking, D., and Thistlewaite, P. (1994). Searching for meaning with the help of a PADRE. In Proceedings of the 3rd Text REtrieval Conference (TREC-3), pages 257–267. Gaithersburg, Maryland.
Jaakkola, J., and Kilpeläinen, P. (1999). Nested Text-Region Algebra. Technical Report CC-1999-2. Department of Computer Science, University of Helsinki, Finland.
Lester, N., Moffat, A., Webber, W., and Zobel, J. (2005). Space-limited ranked query evaluation using adaptive pruning. In Proceedings of the 6th International Conference on Web Information Systems Engineering, pages 470–477. New York.
Moffat, A., and Zobel, J. (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349–379.
Navarro, G., and Baeza-Yates, R. (1997). Proximal nodes: A model to query document databases by content and structure. ACM Transactions on Information Systems, 15(4):400–435.
Ntoulas, A., and Cho, J. (2007). Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191–198. Amsterdam, The Netherlands.
Open Text Corporation (2001). Ten Years of Innovation. Waterloo, Canada: Open Text Coporation.
Persin, M., Zobel, J., and Sacks-Davis, R. (1996). Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10):749–764.
Salminen, A., and Tompa, F. W. (1994). Pat expressions — An algebra for text search. Acta Linguistica Hungarica, 41(1–4):277–306.
Smith, M. E. (1990). Aspects of the P-Norm Model of Information Retrieval: Syntactic Query Generation, Efficiency, and Theoretical Properties. Ph.D. thesis, Cornell University, Ithaca, New York.
Strohman, T., Turtle, H., and Croft, W. B. (2005). Optimization strategies for complex queries. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 219–225. Salvador, Brazil.
Turtle, H., and Flood, J. (1995). Query evaluation: Strategies and optimization. Information Processing & Management, 31(1):831–850.
Young-Lai, M., and Tompa, F. W. (2003). One-pass evaluation of region algebra expressions. Information Systems, 28(3):159–168.
Zhang, C., Naughton, J., DeWitt, D., Luo, Q., and Lohman, G. (2001). On supporting containment queries in relational database management systems. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 425–436. Santa Barbara, California.
Zhu, M., Shi, S., Yu, N., and Wen, J. R. (2008). Can phrase indexing help to process non-phrase queries? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pages 679–688. Napa, California.

Chapter 6: Index Compression

Anh, V. N., and Moffat, A. (2005). Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1):151–166.
Bell, T. C., Cleary, J. G., and Witten, I. H. (1990). Text Compression. Upper Saddle River, New Jersey: Prentice-Hall.
Blandford, D. K., and Blelloch, G. E. (2002). Index compression through document reordering. In Data Compression Conference, pages 342–351. Snowbird, Utah.
Burrows, M., and Wheeler, D. (1994). A Block-Sorting Lossless Data Compression Algorithm. Technical Report SRC-RR-124. Digital Systems Research Center, Palo Alto, California.
Büttcher, S., and Clarke, C. L. A. (2007). Index compression is good, especially for random access. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, pages 761–770. Lisbon, Portugal.
Cleary, J. G., and Witten, I. H. (1984). Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396–402.
Cormack, G. V., and Horspool, R. N. S. (1987). Data compression using dynamic Markov modelling. The Computer Journal, 30(6):541–550.
Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203.
Fraenkel, A. S., and Klein, S. T. (1985). Novel compression of sparse bit-strings. In Apostolico, A., and Galil, Z., editors, Combinatorial Algorithms on Words, pages 169–183. New York: Springer.
Gallager, R. G. (1978). Variations on a theme by Huffman. IEEE Transactions on Information Theory, 24(6):668–674.
Gallager, R. G., and Voorhis, D. C. V. (1975). Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information Theory, 21(2):228–230.
Golomb, S. W. (1966). Run-length encodings. IEEE Transactions on Information Theory, 12:399–401.
Horibe, Y. (1977). An improved bound for weight-balanced tree. Information and Control, 34(2):148–151.
Larmore, L. L., and Hirschberg, D. S. (1990). A fast algorithm for optimal length-limited Huffman codes. Journal of the ACM, 37(3):464–473.
Martin, G. N. N. (1979). Range encoding: An algorithm for removing redundancy from a digitised message. In Proceedings of the Conference on Video and Data Recording. Southampton, England.
Moffat, A., and Stuiver, L. (2000). Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25–47.
Patterson, D. A., and Hennessy, J. L. (2009). Computer Organization and Design: The Hardware/Software Interface (4th ed.). San Francisco, California: Morgan Kaufmann.
Rice, R. F., and Plaunt, J. R. (1971). Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Transactions on Commununication Technology, 19(6):889–897.
Rissanen, J. (1976). Generalized Kraft inequality and arithmetic coding. IBM Journal of Research and Development, 20(3):198–203.
Rissanen, J., and Langdon, G. G. (1979). Arithmetic coding. IBM Journal of Research and Development, 23(2):149–162.
Salomon, D. (2007). Data Compression: The Complete Reference (4th ed.). London, England: Springer.
Sayood, K. (2005). Introduction to Data Compression (3rd ed.). San Francisco, California: Morgan Kaufmann.
Scholer, F., Williams, H. E., Yiannis, J., and Zobel, J. (2002). Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222–229. Tampere, Finland.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 623–656.
Shannon, C. E. (1951). Prediction and entropy of printed English. Bell System Technical Journal, 30:50–64.
Shieh, W. Y., Chen, T. F., Shann, J. J. J., and Chung, C. P. (2003). Inverted file compression through document identifier reassignment. Information Processing & Management, 39(1):117–131.
Silvestri, F. (2007). Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research, pages 101–112. Rome, Italy.
Szpankowski, W. (2000). Asymptotic average redundancy of Huffman (and other) block codes. IEEE Transactions on Information Theory, 46(7):2434–2443.
Trotman, A. (2003). Compressing inverted files. Information Retrieval, 6(1):5–19.
Williams, H. E., and Zobel, J. (1999). Compressing integers for fast file access. The Computer Journal, 42(3):193–201.
Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images (2nd ed.). San Francisco, California: Morgan Kaufmann.
Witten, I. H., Neal, R. M., and Cleary, J. G. (1987). Arithmetic coding for data compression. Commununications of the ACM, 30(6):520–540.
Ziv, J., and Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343.
Zobel, J., and Moffat, A. (2006). Inverted files for text search engines. ACM Computing Surveys, 38(2):1–56.

Chapter 7: Dynamic Inverted Indices

Büttcher, S. (2007). Multi-User File System Search. Ph.D. thesis, University of Waterloo, Waterloo, Canada.
Büttcher, S., and Clarke, C. L. A. (2005a). Indexing Time vs. Query Time Trade-offs in Dynamic Information Retrieval Systems. Technical Report CS-2005-31. University of Waterloo, Waterloo, Canada.
Büttcher, S., and Clarke, C. L. A. (2005b). A security model for full-text file system search in multi-user environments. In Proceedings of the 4th USENIX Conference on File and Storage Technologies, pages 169–182. San Francisco, California.
Büttcher, S., and Clarke, C. L. A. (2006). A hybrid approach to index maintenance in dynamic text retrieval systems. In Proceedings of the 28th European Conference on Information Retrieval, pages 229–240. London, England.
Büttcher, S., and Clarke, C. L. A. (2008). Hybrid index maintenance for contiguous inverted lists. Information Retrieval, 11(3):175–207.
Büttcher, S., Clarke, C. L. A., and Lushman, B. (2006). Hybrid index maintenance for growing text collections. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 356–363. Seattle, Washington.
Chiueh, T., and Huang, L. (1998). Efficient Real-Time Index Updates in Text Retrieval Systems. Technical report. SUNY at Stony Brook, Stony Brook, New York.
Cutting, D. R., and Pedersen, J. O. (1990). Optimization for dynamic inverted index maintenance. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 405–411. Brussels, Belgium.
García-Molina, H., Ullman, J., and Widom, J. (2002). Database Systems: The Complete Book. Upper Saddle River, New Jersey: Prentice Hall.
Lester, N., Moffat, A., and Zobel, J. (2005). Fast on-line index construction by geometric partitioning. In Proceedings of the 14th ACM Conference on Information and Knowledge Management, pages 776–783. Bremen, Germany.
Lester, N., Zobel, J., and Williams, H. E. (2004). In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. In Proceedings of the 27th Conference on Australasian Computer Science, pages 15–22. Dunedin, New Zealand.
Lester, N., Zobel, J., and Williams, H. E. (2006). Efficient online index maintenance for contiguous inverted lists. Information Processing & Management, 42(4):916–933.
Lim, L., Wang, M., Padmanabhan, S., Vitter, J. S., and Agarwal, R. (2003). Dynamic maintenance of web indexes using landmarks. In Proceedings of the 12th International Conference on World Wide Web, pages 102–111. Budapest, Hungary.
Shieh, W. Y., and Chung, C. P. (2005). A statistics-based approach to incrementally update inverted files. Information Processing & Management, 41(2):275–288.
Shoens, K. A., Tomasic, A., and García-Molina, H. (1994). Synthetic workload performance analysis of incremental updates. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 329–338. Dublin, Ireland.
Strohman, T. (2005). Dynamic Collections in Indri. Technical Report IR-426. University of Massachusetts Amherst, Amherst, Massachusetts.
Tomasic, A., García-Molina, H., and Shoens, K. (1994). Incremental updates of inverted lists for text document retrieval. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pages 289–300. Minneapolis, Minnesota.
Zobel, J., Moffat, A., and Sacks-Davis, R. (1993). Storage management for files of dynamic records. In Proceedings of the 4th Australian Database Conference, pages 26–38. Brisbane, Australia.

Chapter 8: Probabilistic Retrieval

Baeza-Yates, R. A., and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Reading, Massachusetts: Addison-Wesley.
Bookstein, A., and Kraft, D. (1977). Operations research applied to document indexing and retrieval decisions. Journal of the ACM, 24(3):418–427.
Bookstein, A., and Swanson, D. R. (1974). Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25(5):312–319.
Büttcher, S., Clarke, C. L. A., and Lushman, B. (2006). Term proximity scoring for ad-hoc retrieval on very large text collections. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 621–622. Seattle, Washington.
Cao, G., Nie, J. Y., Gao, J., and Robertson, S. (2008). Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 243–250. Singapore.
Church, K. W., and Gale, W. A. (1995). Inverse document frequency (IDF): A measure of deviation from poisson. In Proceedings of the 3rd Workshop on Very Large Corpora, pages 121–130. Cambridge, Massachusetts.
Collins-Thompson, K. (2009). Reducing the risk of query expansion via robust constrained optimization. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 837–846. Hong Kong, China.
Craswell, N., Robertson, S., Zaragoza, H., and Taylor, M. (2005a). Relevance weighting for query independent evidence. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 416–423. Salvador, Brazil.
Craswell, N., Zaragoza, H., and Robertson, S. (2005b). Microsoft Cambridge at TREC 14: Enterprise track. In Proceedings of the 14th Text REtrieval Conference. Gaithersburg, Maryland.
Croft, W. B., and Harper, D. J. (1979). Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35:285–295.
de Vries, A. P., and Roelleke, T. (2005). Relevance information: A loss of entropy but a gain for IDF? In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 282–289. Salvador, Brazil.
Fuhr, N. (1992). Probabilistic models in information retrieval. The Computer Journal, 35(3):243–255.
Greiff, W. R., Croft, W. B., and Turtle, H. (1999). PIC matrices: A computationally tractable class of probabilistic query operators. ACM Transactions on Information Systems, 17(4):367–405.
Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing: Part I. On the distribution of specialty words in a technical literature. Journal of the American Society for Information Science, 26:197–206.
Hawking, D., Upstill, T., and Craswell, N. (2004). Toward better weighting of anchors. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 512–513. Sheffield, England.
Lafferty, J., and Zhai, C. (2003). Probabilistic relevance models based on document and query generation. In Croft, W. B., and Lafferty, J., editors, Language Modeling for Information Retrieval, chapter 1, pages 1–10. Dordrecht, The Netherlands: Kluwer Academic Publishers.
Lee, K. S., Croft, W. B., and Allan, J. (2008). A cluster-based resampling method for pseudo-relevance feedback. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 235–242. Singapore.
Lynam, T. R., Buckley, C., Clarke, C. L. A., and Cormack, G. V. (2004). A multi-system analysis of document and term selection for blind feedback. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, pages 261–269. Washington, D.C.
Maron, M. E., and Kuhns, J. L. (1960). On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7(3):216–244.
Ponte, J. M., and Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 275–281. Melbourne, Australia.
Rasolofo, Y., and Savoy, J. (2003). Term proximity scoring for keyword-based retrieval systems. In Proceedings of the 25th European Conference on Information Retrieval Research, pages 207–218. Pisa, Italy.
Robertson, S. (1977). The probability ranking principle in IR. Journal of Documentation, 33:294–304.
Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5):503–520.
Robertson, S., and Spärck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129–146.
Robertson, S., and Zaragoza, H. (2010). The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 4.
Robertson, S., Zaragoza, H., and Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, pages 42–49. Washington, D.C.
Robertson, S. E. (1990). On term selection for query expansion. Journal of Documentation, 46(4):359–364.
Robertson, S. E., van Rijsbergen, C. J., and Porter, M. F. (1981). Probabilistic models of indexing and searching. In Oddy, R. N., Robertson, S. E., van Rijsbergen, C. J., and Williams, P. W., editors, Information Retrieval Research, chapter 4, pages 35–56. London, England: Buttersworths.
Robertson, S. E., and Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232–241. Dublin, Ireland.
Robertson, S. E., and Walker, S. (1997). On relevance weights with little relevance information. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 16–24. Philadelphia, Pennsylvania.
Robertson, S. E., and Walker, S. (1999). Okapi/keenbow at TREC-8. In Proceedings of the 8th Text REtrieval Conference. Gaithersburg, Maryland.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. (1994). Okapi at TREC-3. In Proceedings of the 3rd Text REtrieval Conference. Gaithersburg, Maryland.
Rocchio, J. J. (1971). Relevance feedback in information retrieval. In Salton, G., editor, The SMART Retrieval System: Experiments in Automatic Document Processing, chapter 14, pages 313–323: Prentice-Hall.
Roelleke, T., and Wang, J. (2006). A parallel derivation of probabilistic information retrieval models. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 107–114. Seattle, Washington.
Roelleke, T., and Wang, J. (2008). TF-IDF uncovered: A study of theories and probabilities. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 435–442. Singapore, Singapore.
Ruthven, I., and Lalmas, M. (2003). A survey on the use of relevance feedback for information access systems. Knowledge Engineering Review, 18(2):95–145.
Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21.
Spärck Jones, K., Walker, S., and Robertson, S. E. (2000a). A probabilistic model of information retrieval: Development and comparative experiments – Part 1. Information Processing & Management, 36(6):779–808.
Spärck Jones, K., Walker, S., and Robertson, S. E. (2000b). A probabilistic model of information retrieval: Development and comparative experiments – Part 2. Information Processing & Management, 36(6):809–840.
Troy, A. D., and Zhang, G. Q. (2007). Enhancing relevance scoring with chronological term rank. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 599–606. Amsterdam, The Netherlands.
Turtle, H., and Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187–222.
Wang, X., Fang, H., and Zhai, C. (2008). A study of methods for negative relevance feedback. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 219–226. Singapore.
Zaragoza, H., Craswell, N., Taylor, M., Saria, S., and Robertson, S. (2004). Microsoft Cambridge at TREC 13: Web and Hard tracks. In Proceedings of the 13th Text REtrieval Conference. Gaithersburg, Maryland.

Chapter 9: Language Modeling and Related Methods

Amati, G., Carpineto, C., and Romano, G. (2003). Fondazione Ugo Bordoni at TREC 2003: Robust and Web Track. In Proceedings of the 12th Text REtrieval Conference. Gaithersburg, Maryland.
Amati, G., and van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. 20(4):357–389.
Berger, A., and Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222–229. Berkeley, California.
Cao, G., Nie, J. Y., and Bai, J. (2005). Integrating word relationships into language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 298–305. Salvador, Brazil.
Cao, G., Nie, J. Y., Gao, J., and Robertson, S. (2008). Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 243–250. Singapore.
Chen, S. F., and Goodman, J. (1998). An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98. Aiken Computer Laboratory, Harvard University.
Clarke, C. L. A., Cormack, G. V., Lynam, T. R., and Terra, E. L. (2006). Question answering by passage selection. In Strzalkowski, T., and Harabagiu, S., editors, Advances in Open Domain Question Answering. Berlin, Germany: Springer.
Croft, W. B., and Lafferty, J., editors (2003). Language Modeling for Information Retrieval. Dordrecht, The Netherlands: Kluwer Academic Publishers.
Greiff, W. R., Croft, W. B., and Turtle, H. (1999). PIC matrices: A computationally tractable class of probabilistic query operators. ACM Transactions on Information Systems, 17(4):367–405.
Hiemstra, D. (2001). Using language models for information retrieval. Ph.D. thesis, University of Twente, The Netherlands.
Jelinek, F., and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice. Amsterdam, The Netherlands.
Lafferty, J., and Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 111–119. New Orleans, Louisiana.
Lavrenko, V., and Croft, W. B. (2001). Relevance based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 120–127. New Orleans, Louisiana.
Lv, Y., and Zhai, C. (2009). Positional language models for information retrieval. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 299–306. Boston, Massachusetts.
Manning, C. D., and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: MIT Press.
Metzler, D., and Croft, W. B. (2004). Combining the language model and inference network approaches to retrieval. Information Processing & Management, 40(5):735–750.
Miller, D. R. H., Leek, T., and Schwartz, R. M. (1999). A hidden Markov model information retrieval system. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 214–221. Berkeley, California.
Plachouras, V., Ounis, I., Amati, G., and Rijsbergen, C. V. (2002). University of Glasgow at the Web Track of TREC 2002. In Proceedings of the 11th Text REtrieval Conference. Gaithersburg, Maryland.
Ponte, J. M., and Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 275–281. Melbourne, Australia.
Song, F., and Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of the 8th International Conference on Information and Knowledge Management, pages 316–321. Kansas City, Missouri.
Tao, T., and Zhai, C. (2007). An exploration of proximity measures in information retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 295–302. Amsterdam, The Netherlands.
Tellex, S., Katz, B., Lin, J., Fernandes, A., and Marton, G. (2003). Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. Toronto, Canada.
Turtle, H., and Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187–222.
Zhai, C. (2008a). Statistical Language Models for Information Retrieval. Synthesis Lectures on Human Language Technologies: Morgan & Claypool.
Zhai, C. (2008b). Statistical language models for information retrieval: A critical review. Foundations and Trends in Information Retrieval, 2.
Zhai, C., and Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 334–342. New Orleans, Louisiana.
Zhai, C., and Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179–214.
Zhao, J., and Yun, Y. (2009). A proximity language model for information retrieval. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 291–298. Boston, Massachusetts.

Chapter 10: Categorization and Filtering

Belkin, N. J., and Croft, W. B. (1992). Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12):29–38.
Bratko, A., Cormack, G. V., Filipic, B., Lynam, T. R., and Zupan, B. (2006). Spam filtering using statistical data compression models. Journal of Machine Learning Research, 7:2673–2698.
Callan, J. (1998). Learning while filtering documents. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 224–231. Melbourne, Australia.
Cleary, J. G., and Witten, I. H. (1984). Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396–402.
Cormack, G. V. (2007). TREC 2007 Spam Track overview. In Proceedings of the 16th Text REtrieval Conference. Gaithersburg, Maryland.
Cormack, G. V. (2008). Email spam filtering: A systematic review. Foundations and Trends in Information Retrieval, 1(4):335–455.
Cormack, G. V., and Horspool, R. N. S. (1987). Data compression using dynamic Markov modelling. The Computer Journal, 30(6):541–550.
Cormack, G. V., and Lynam, T. R. (2005). TREC 2005 Spam Track overview. In Proceedings of the 14th Text REtrieval Conference. Gaithersburg, Maryland.
Domingos, P., and Pazzani, M. J. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2-3):103–130.
Drucker, H., Wu, D., and Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048–1054.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874.
Glas, A. S., Lijmer, J. G., Prins, M. H., Bonsel, G. J., and Bossuyt, P. M. M. (2003). The diagnostic odds ratio: A single indicator of test performance. Journal of Clinical Epidemiology, 56(11):1129–1135.
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009). The Elements of Statistical Learning (2nd ed.). Berlin, Germany: Springer.
Hosmer, D. W., and Lemeshow, S. (2000). Applied Logistic Regression (2nd ed.). New York: Wiley-Interscience.
Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines. Norwell, Massachusetts: Kluwer Academic.
Komarek, P., and Moore, A. (2003). Fast robust logistic regression for large sparse datasets with binary outputs. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics. Key West, Florida.
Lewis, D. D. (1991). Evaluating text categorization. In Human Language Technologies Conference: Proceedings of the Workshop on Speech and Natural Language, pages 312–318. Pacific Grove, California.
Lewis, D. D., and Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the 11th International Conference on Machine Learning, pages 148–156.
Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397.
McNamee, P. (2005). Language identification: A solved problem suitable for undergraduate instruction. Journal of Computing Sciences in Colleges, 20(3):94–101.
Mitchell, T. M. (1997). Machine Learning. Boston, Massachusetts: WCB/McGraw-Hill.
Robertson, S. (2002). Threshold setting and performance optimization in adaptive filtering. Information Retrieval, 5(2-3):239–256.
Robertson, S., and Callan, J. (2005). Routing and filtering. In Voorhees, E. M., and Harman, D. K., editors, TREC — Experiment and Evaluation in Information Retrieval, chapter 5, pages 99–122. Cambridge, Massachusetts: MIT Press.
Rocchio, J. J. (1971). Relevance feedback in information retrieval. In Salton, G., editor, The SMART Retrieval System: Experiments in Automatic Document Processing, chapter 14, pages 313–323: Prentice-Hall.
Sculley, D. (2007). Practical learning from one-sided feedback. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 609–618. San Jose, California.
Sculley, D., and Wachman, G. M. (2007). Relaxed online support vector machines for spam filtering. In Proceedings of the 30th ACM SIGIR Conference on Research and Development on Information Retrieval, pages 415–422. Amsterdam, The Netherlands.
Sculley, D., Wachman, G. M., and Brodley, C. E. (2006). Spam classification with on-line linear classifiers and inexact string matching features. In Proceedings of the 15th Text REtrieval Conference. Gaithersburg, Maryland.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.
Siefkes, C., Assis, F., Chhabra, S., and Yerazunis, W. S. (2004). Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 410–421. Pisa, Italy.
Swets, J. A. (1963). Information retrieval systems. Science, 141(357):245–250.
Swets, J. A. (1969). Effectiveness of information retrieval systems. American Documentation, 20:72–89.
van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London, England: Butterworths.
Willems, F. M. J., Shtarkov, Y. M., and Tjalkens, T. J. (1995). The context-tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41:653–664.
Witten, I. H., and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (2nd ed.). San Francisco, California: Morgan Kaufmann.

Chapter 11: Fusion and Metalearning

Agresti, A. (2007). An Introduction to Categorical Data Analysis (2nd ed.). New York: Wiley-Interscience.
Belkin, N., Kantor, P., Fox, E., and Shaw, J. (1995). Combining the evidence of multiple query representations for information retrieval. Information Processing & Management, 31(3):431–448.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123–140.
Burges, C. J. C., Ragno, R., and Le, Q. V. (2006). Learning to rank with nonsmooth cost functions. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, pages 193–200. Vancouver, Canada.
Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning, pages 89–96. Bonn, Germany.
Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., and Li, H. (2007). Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, pages 129–136. Corvalis, Oregon.
Cormack, G. V., Clarke, C. L. A., and Büttcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 758–759. Boston, Massachusetts.
Crammer, K., and Singer, Y. (2002). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292.
Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Boca Raton, Florida: Chapman & Hall/CRC.
Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969.
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009). The Elements of Statistical Learning (2nd ed.). Berlin, Germany: Springer.
Herbrich, R., Graepel, T., and Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In Bartlett, P. J., Schölkopf, B., Schuurmans, D., and Smola, A. J., editors, Advances in Large Margin Classifiers, chapter 7, pages 115–132. Cambridge, Massachusetts: MIT Press.
Hosmer, D. W., and Lemeshow, S. (2000). Applied Logistic Regression (2nd ed.). New York: Wiley-Interscience.
Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 133–142. Edmonton, Canada.
Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. (2005). Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 154–161. Salvador, Brazil.
Lee, J. H. (1997). Analyses of multiple evidence combination. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 267–276.
Li, P., Burges, C., and Wu, Q. (2007). McRank: Learning to rank using multiple classification and gradient boosting. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems, pages 897–904. Vancouver, Canada.
Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225–331.
Liu, T. Y., Xu, J., Qin, T., Xiong, W., and Li, H. (2007). LETOR: Benchmark dataset for research on learning to rank for information retrieval. In Proceedings of SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, pages 481–490. Amsterdam, The Netherlands.
Lynam, T. R., and Cormack, G. V. (2006). On-line spam filter fusion. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 123–130. Seattle, Washington.
Meng, W., Yu, C., and Liu, K. L. (2002). Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1):48–89.
Montague, M., and Aslam, J. A. (2002). Condorcet fusion for improved retrieval. In Proceedings of the 11th International Conference on Information and Knowledge Management, pages 538–548. McLean, Virginia.
Schapire, R. (2003). The boosting approach to machine learning: An overview. In Denison, D. D., Hansen, M. H., Holmes, C. C., Mallick, B., and Yu, B., editors, Nonlinear Estimation and Classification, volume 171 of Lecture Notes in Statistics, pages 149–172. Berlin, Germany: Springer.
Schapire, R., and Singer, Y. (2000). BoosTexter: A boosting-based system for text categorization. Machine learning, 39(2):135–168.
Surowiecki, J. (2004). The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. New York: Doubleday.
Svore, K. M., and Burges, C. J. (2009). A machine learning approach for improved BM25 retrieval. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 1811–1814. Hong Kong, China.
Vogt, C., and Cottrell, G. (1999). Fusion via a linear combination of scores. Information Retrieval, 1(3):151–173.
Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. (1995). Learning collection fusion strategies. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–179. Seattle, Washington.
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5:241–259.
Xu, J., and Li, H. (2007). Adarank: A boosting algorithm for information retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 391–398. Amsterdam, The Netherlands.
Yilmaz, E., and Robertson, S. (2010). On the choice of effectiveness measures for learning to rank. Information Retrieval.

Chapter 12: Measuring Effectiveness

Agrawal, R., Gollapudi, S., Halverson, A., and Ieong, S. (2009). Diversifying search results. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, pages 5–14. Barcelona, Spain.
Ahlgren, P., and Grönqvist, L. (2008). Evaluation of retrieval effectiveness with incomplete relevance data: Theoretical and experimental comparison of three measures. Information Processing & Management, 44(1):212–225.
Al-Maskari, A., Sanderson, M., and Clough, P. (2007). The relationship between IR effectiveness measures and user satisfaction. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 773–774. Amsterdam, The Netherlands.
Amitay, E., Carmel, D., Lempel, R., and Soffer, A. (2004). Scaling IR-system evaluation using term relevance sets. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 10–17. Sheffield, England.
Aslam, J. A., Pavlu, V., and Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 541–548. Seattle, Washington.
Aslam, J. A., Yilmaz, E., and Pavlu, V. (2005). The maximum entropy method for analyzing retrieval measures. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 27–34. Salvador, Brazil.
Bernstein, Y., and Zobel, J. (2005). Redundant documents and search effectiveness. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pages 736–743. Bremen, Germany.
Boyce, B. (1982). Beyond topicality: A two stage view of relevance and the retrieval process. Information Processing & Management, 18(3):105–109.
Buckley, C., and Voorhees, E. (2005). Retrieval system evaluation. In Voorhees, E. M., and Harman, D. K., editors, TREC — Experiment and Evaluation in Information Retrieval, chapter 3, pages 53–75. Cambridge, Massachusetts: MIT Press.
Buckley, C., and Voorhees, E. M. (2004). Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 25–32. Sheffield, England.
Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning, pages 89–96. Bonn, Germany.
Büttcher, S., Clarke, C. L. A., Yeung, P. C. K., and Soboroff, I. (2007). Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 63–70. Amsterdam, The Netherlands.
Carbonell, J., and Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 335–336. Melbourne, Australia.
Carterette, B. (2007). Robust test collections for retrieval evaluation. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 55–62. Amsterdam, The Netherlands.
Carterette, B. (2009a). An analysis of NP-completeness in novelty and diversity ranking. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval, pages 200–211. Cambridge, England.
Carterette, B. (2009b). On rank correlation and the distance between rankings. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 436–443. Boston, Massachusetts.
Carterette, B., Allan, J., and Sitaraman, R. (2006). Minimal test collections for retrieval evaluation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 268–275. Seattle, Washington.
Chapelle, O., Metzler, D., Zhang, Y., and Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 621–630. Hong Kong, China.
Chen, H., and Karger, D. R. (2006). Less is more: Probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 429–436. Seattle, Washington.
Clarke, C. L., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkann, A., Büttcher, S., and MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 659–666. Singapore.
Clarke, C. L. A., Kolla, M., and Vechtomova, O. (2009). An effectiveness measure for ambiguous and underspecified queries. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval, pages 188–199. Cambridge, England.
Cleverdon, C. W. (1967). The Cranfield tests on index language devices. AsLib proceedings, 19(6):173–193. Reprinted as Cleverdon (1997).
Cleverdon, C. W. (1997). The Cranfield tests on index language devices. In Readings in Information Retrieval, pages 47–59. San Francisco, California: Morgan Kaufmann.
Cormack, G. V., and Lynam, T. R. (2006). Statistical precision of information retrieval evaluation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 533–540. Seattle, Washington.
Cormack, G. V., and Lynam, T. R. (2007). Power and bias of subset pooling strategies. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 837–838. Amsterdam, The Netherlands.
Cormack, G. V., Palmer, C. R., and Clarke, C. L. A. (1998). Efficient construction of large test collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 282–289. Melbourne, Australia.
Custis, T., and Al-Kofahi, K. (2007). A new approach for evaluating query expansion: Query-document term mismatch. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 575–582. Amsterdam, The Netherlands.
De Angelis, C., Drazen, J., Frizelle, F., Haug, C., Hoey, J., Horton, R., Kotzin, S., Laine, C., Marusic, A., Overbeke, A., et al. (2004). Clinical trial registration: A statement from the International Committee of Medical Journal Editors. Journal of the American Medical Association, 292(11):1363–1364.
Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Boca Raton, Florida: Chapman & Hall/CRC.
Fisher, R. A. (1925). Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22:700–725.
Gardner, M. J., and Altman, D. G. (1986). Confidence intervals rather than p values: Estimation rather than hypothesis testing. British Medical Journal, 292(6522):746–750.
Goffman, W. (1964). A searching procedure for information retrieval. Information Storage and Retrieval, 2:73–78.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6:65–70.
Järvelin, K., and Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446.
Kelly, D., Fu, X., and Shah, C. (2007). Effects of Rank and Precision of Search Results on Users' Evaluations of System Performance. Technical Report 2007-02. University of North Carolina, Chapel Hill.
Lenhard, J. (2006). Models and statistical inference: The controversy between Fisher and Neyman-Pearson. British Journal for the Philosophy of Science, 57(1).
Moffat, A., Webber, W., and Zobel, J. (2007). Strategic system comparisons via targeted relevance judgments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 375–382. Amsterdam, The Netherlands.
Moffat, A., and Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1):1–27.
Najork, M. A., Zaragoza, H., and Taylor, M. J. (2007). HITS on the Web: How does it compare? In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 471–478. Amsterdam, The Netherlands.
Robertson, S. (2006). On GMAP – and other transformations. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 78–83. Arlington, Virginia.
Sakai, T., and Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval, 11(5):447–470.
Sanderson, M., and Joho, H. (2004). Forming test collections with no system pooling. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 33–40. Sheffield, England.
Sanderson, M., and Zobel, J. (2005). Information retrieval system evaluation: effort, sensitivity, and reliability. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 162–169. Salvador, Brazil.
Savoy, J. (1997). Statistical inference in retrieval effectiveness evaluation. Information Processing & Management, 33(4):495–512.
Shah, C., and Croft, W. B. (2004). Evaluating high accuracy retrieval techniques. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2–9. Sheffield, England.
Smucker, M., Allan, J., and Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the 16th ACM conference on Conference on Information and Knowledge Management, pages 623–632. Lisbon, Portugal.
Soboroff, I., Nicholas, C., and Cahan, P. (2001). Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 66–73. New Orleans, Louisiana.
Spärck Jones, K., Robertson, S. E., and Sanderson, M. (2007). Ambiguous requests: Implications for retrieval tests. ACM SIGIR Forum, 41(2):8–17.
Thomas, L. (1997). Retrospective power analysis. Conservation Biology, 11(1):276–280.
Turpin, A., and Scholer, F. (2006). User performance versus precision measures for simple search tasks. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 11–18. Seattle, Washington.
van Zwol, R., Murdock, V., Garcia Pueyo, L., and Ramirez, G. (2008). Diversifying image search with user generated content. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, pages 67–74. Vancouver, Canada.
Vee, E., Srivastava, U., Shanmugasundaram, J., Bhat, P., and Amer-Yahia, A. (2008). Efficient computation of diverse query results. In Proceedings of the 24th IEEE International Conference on Data Engineering, pages 228–236. Cancun, Mexico.
Voorhees, E., and Harman, D. (1999). Overview of the eighth text retrieval conference. In Proceedings of the 8th Text REtrieval Conference, pages 1–24. Gaithersburg, Maryland.
Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5):697–716.
Voorhees, E. M. (2004). Overview of the TREC 2004 Robust Track. In Proceedings of the 13th Text REtrieval Conference. Gaithersburg, Maryland.
Voorhees, E. M., and Dang, H. T. (2005). Overview of the TREC 2005 Question Answering track. In Proceedings of the 14th Text REtrieval Conference. Gaithersburg, Maryland.
Voorhees, E. M., and Harman, D. K. (2005). The Text REtrieval Conference. In Voorhees, E. M., and Harman, D. K., editors, TREC — Experiment and Evaluation in Information Retrieval, chapter 1, pages 3–20. Cambridge, Massachusetts: MIT Press.
Webber, W., Moffat, A., and Zobel, J. (2008a). Score standardization for inter-collection comparison of retrieval systems. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 51–58. Singapore.
Webber, W., Moffat, A., and Zobel, J. (2008b). Statistical power in retrieval experimentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pages 571–580. Napa, California.
Yilmaz, E., and Aslam, J. A. (2008). Estimating average precision when judgments are incomplete. International Journal of Knowledge and Information Systems, 16(2):173–211.
Yilmaz, E., Kanoulas, E., and Aslam, J. A. (2008). A simple and efficient sampling method for estimating AP and NDCG. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 603–610. Singapore.
Zhai, C., Cohen, W. W., and Lafferty, J. (2003). Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 10–17. Toronto, Canada.
Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 307–314. Melbourne, Australia.

Chapter 13: Measuring Efficiency

Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri, F. (2007). The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 183–190. Amsterdam, The Netherlands.
Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. (2004). Hourly analysis of a very large topically categorized Web query log. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 321–328. Sheffield, England.
Büttcher, S., Clarke, C. L. A., and Soboroff, I. (2006). The TREC 2006 terabyte track. In Proceedings of the 15th Text REtrieval Conference (TREC 2006). Gaithersburg, Maryland.
Cao, P., and Irani, S. (1997). Cost-aware WWW proxy caching algorithms. In Proceedings of the 1997 USENIX Symposium on Internet Technologies and Systems, pages 193–206. Monterey, California.
Clarke, C. L. A., Scholer, F., and Soboroff, I. (2005). The TREC 2005 terabyte track. In Proceedings of the 14th Text REtrieval Conference. Gaithersburg, Maryland.
Fagni, T., Perego, R., Silvestri, F., and Orlando, S. (2006). Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems, 24(1):51–78.
Garcia, S. (2007). Search Engine Optimisation Using Past Queries. Ph.D. thesis, RMIT University, Melbourne, Australia.
Gross, D., Shortle, J., Thompson, J., and Harris, C. (2008). Fundamentals of Queueing Theory (4th ed.). New York: Wiley-Interscience.
Harrison, P. G. (1993). Response time distributions in queueing network models. In Performance Evaluation of Computer and Communication Systems, Joint Tutorial Papers of Performance '93 and Sigmetrics '93, pages 147–164. Santa Clara, California.
Hawking, D., Craswell, N., and Thistlewaite, P. (1998). Overview of TREC-7 very large collection track. In Proceedings of the 7th Text REtrieval Conference. Gaithersburg, Maryland.
Hawking, D., and Thistlewaite, P. (1997). Overview of TREC-6 very large collection track. In Proceedings of the 6th Text REtrieval Conference, pages 93–106. Gaithersburg, Maryland.
Kleinrock, L. (1975). Queueing Systems. Volume 1: Theory. New York: Wiley-Interscience.
Lilja, D. J. (2000). Measuring Computer Performance: A Practitioner's Guide. New York: Cambridge University Press.
Little, J. D. C. (1961). A proof for the queueing formula L=λW. Operations Research, 9(3):383–387.
Long, X., and Suel, T. (2005). Three-level caching for efficient query processing in large web search engines. In Proceedings of the 14th International Conference on World Wide Web, pages 257–266. Chiba, Japan.
Saraiva, P. C., de Moura, E. S., Ziviani, N., Meira, W., Fonseca, R., and Riberio-Neto, B. (2001). Rank-preserving two-level caching for scalable search engines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 51–58. New Orleans, Louisiana.
Turpin, A., Tsegay, Y., Hawking, D., and Williams, H. E. (2007). Fast generation of result snippets in web search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 127–134. Amsterdam, The Netherlands.
Zhang, J., Long, X., and Suel, T. (2008). Performance of compressed inverted list caching in search engines. In Proceeding of the 17th International Conference on World Wide Web, pages 387–396. Beijing, China.

Chapter 14: Parallel Information Retrieval

Abusukhon, A., Talib, M., and Oakes, M. P. (2008). An investigation into improving the load balance for term-based partitioning. In Proceedings of the 2nd International United Information Systems Conference, pages 380–392. Klagenfurt, Austria.
Barroso, L. A., Dean, J., and Hölzle, U. (2003). Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22–28.
Carbonell, J. G., and Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 335–336. Melbourne, Australia.
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems, 26(2):1–26.
Clarke, C. L. A., and Terra, E. L. (2004). Approximating the top-m passages in a parallel question answering system. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, pages 454–462. Washington, D.C.
Dean, J., and Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation, pages 137–150. San Francisco, California.
Dean, J., and Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113.
Ding, S., He, J., Yan, H., and Suel, T. (2009). Using graphics processors for high performance IR query processing. In Proceedings of the 18th International Conference on World Wide Web, pages 421–430. Madrid, Spain.
Ghemawat, S., Gobioff, H., and Leung, S. T. (2003). The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 29–43. Bolton Landing, New York.
Govindaraju, N., Gray, J., Kumar, R., and Manocha, D. (2006). GPUTeraSort: High performance graphics co-processor sorting for large database management. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pages 325–336. Chicago, Illinois.
Marín, M., and Gil-Costa, V. (2007). High-performance distributed inverted files. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, pages 935–938. Lisbon, Portugal.
Marín, M., and Navarro, G. (2003). Distributed query processing using suffix arrays. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval, pages 311–325. Manaus, Brazil.
Moffat, A., Webber, W., and Zobel, J. (2006). Load balancing for term-distributed parallel retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 348–355. Seattle, Washington.
Moffat, A., Webber, W., Zobel, J., and Baeza-Yates, R. (2007). A pipelined architecture for distributed text query evaluation. Information Retrieval, 10(3):205–231.
Puppin, D., Silvestri, F., and Laforenza, D. (2006). Query-driven document partitioning and collection selection. In Proceedings of the 1st International Conference on Scalable Information Systems. Hong Kong, China.
Sintorn, E., and Assarsson, U. (2008). Fast parallel GPU-sorting using a hybrid algorithm. Journal of Parallel and Distributed Computing, 68(10):1381–1388.
Xi, W., Sornil, O., Luo, M., and Fox, E. A. (2002). Hybrid partition inverted files: Experimental validation. In Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pages 422–431. Rome, Italy.

Chapter 15: Web Search

Agichtein, E., Brill, E., and Dumais, S. (2006a). Improving Web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 19–26. Seattle, Washington.
Agichtein, E., Brill, E., Dumais, S., and Ragno, R. (2006b). Learning user interaction models for predicting Web search result preferences. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3–10. Seattle, Washington.
Baeza-Yates, R., Boldi, P., and Castillo, C. (2006). Generalizing PageRank: Damping functions for link-based ranking algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 308–315. Seattle, Washington.
Bernstein, Y., and Zobel, J. (2005). Redundant documents and search effectiveness. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pages 736–743. Bremen, Germany.
Bharat, K., and Broder, A. (1998). A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International World Wide Web Conference, pages 379–388. Brisbane, Australia.
Bharat, K., and Henzinger, M. R. (1998). Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104–111. Melbourne, Australia.
Bianchini, M., Gori, M., and Scarselli, F. (2005). Inside PageRank. ACM Transactions on Internet Technology, 5(1):92–128.
Borodin, A., Roberts, G. O., Rosenthal, J. S., and Tsaparas, P. (2001). Finding authorities and hubs from link structures on the World Wide Web. In Proceedings of the 10th International World Wide Web Conference, pages 415–429. Hong Kong, China.
Brin, S., Motwani, R., Page, L., and Winograd, T. (1998). What can you do with a Web in your pocket? Data Engineering Bulletin, 21(2):37–47.
Brin, S., and Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107–117. Brisbane, Australia.
Broder, A. (2002). A taxonomy of Web search. ACM SIGIR Forum, 36(2):3–10.
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. (1997). Syntactic clustering of the Web. In Proceedings of the 6th International World Wide Web Conference, pages 1157–1166. Santa Clara, California.
Büttcher, S., Clarke, C. L. A., and Soboroff, I. (2006). The TREC 2006 Terabyte Track. In Proceedings of the 15th Text REtrieval Conference. Gaithersburg, Maryland.
Carrière, J., and Kazman, R. (1997). WebQuery: Searching and visualizing the Web through connectivity. In Proceedings of the 6th International World Wide Web Conference, pages 1257–1267.
Carterette, B., and Jones, R. (2007). Evaluating search engines by modeling the relationship between relevance and clicks. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems. Vancouver, Canada.
Chakrabarti, S. (2007). Dynamic personalized PageRank in entity-relation graphs. In Proceedings of the 16th International World Wide Web Conference. Banff, Canada.
Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., and Kleinberg, J. (1998). Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference. Brisbane, Australia.
Chakrabarti, S., van den Burg, M., and Dom, B. (1999). Focused crawling: A new approach to topic-specific Web resource discovery. In Proceedings of the 8th International World Wide Web Conference, pages 545–562. Toronto, Canada.
Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pages 380–388. Montreal, Canada.
Cho, J., and Garcia-Molina, H. (2000). The evolution of the Web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases, pages 200–209.
Cho, J., and Garcia-Molina, H. (2003). Effective page refresh policies for Web crawlers. ACM Transactions on Database Systems, 28(4):390–426.
Chung, C., and Clarke, C. L. A. (2002). Topic-oriented collaborative crawling. In Proceedings of the 11th International Conference on Information and Knowledge Management, pages 34–42. McLean, Virginia.
Clarke, C. L. A., Agichtein, E., Dumais, S., and White, R. W. (2007). The influence of caption features on clickthrough patterns in Web search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 135–142. Amsterdam, The Netherlands.
Clarke, C. L. A., Scholer, F., and Soboroff, I. (2005). The TREC 2005 Terabyte Track. In Proceedings of the 14th Text REtrieval Conference. Gaithersburg, Maryland.
Cohn, D., and Chang, H. (2000). Learning to probabilistically identify authoritative documents. In Proceedings of the 17th International Conference on Machine Learning, pages 167–174.
Craswell, N., and Hawking, D. (2004). Overview of the TREC 2004 Web Track. In Proceedings of the 13th Text REtrieval Conference. Gaithersburg, Maryland.
Craswell, N., Hawking, D., and Robertson, S. (2001). Effective site finding using link anchor information. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 250–257. New Orleans, Louisiana.
Craswell, N., Robertson, S., Zaragoza, H., and Taylor, M. (2005). Relevance weighting for query independent evidence. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 416–423. Salvador, Brazil.
Cucerzan, S., and Brill, E. (2004). Spelling correction as an iterative process that exploits the collective knowledge of Web users. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 293–300.
Dasgupta, A., Ghosh, A., Kumar, R., Olston, C., Pandey, S., and Tomkins, A. (2007). The discoverability of the Web. In Proceedings of the 16th International World Wide Web Conference. Banff, Canada.
Davidson, B. D. (2000). Recognizing nepotistic links on the Web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23–28.
Dupret, G., Murdock, V., and Piwowarski, B. (2007). Web search engine evaluation using clickthrough data and a user model. In Proceedings of the 16th International World Wide Web Conference Workshop on Query Log Analysis: Social and Technological Challenges. Banff, Canada.
Edwards, J., McCurley, K., and Tomlin, J. (2001). An adaptive model for optimizing performance of an incremental Web crawler. In Proceedings of the 10th International World Wide Web Conference, pages 106–113. Hong Kong, China.
Golub, G. H., and Van Loan, C. F. (1996). Matrix Computations (3rd ed.). Baltimore, Maryland: Johns Hopkins University Press.
Gulli, A., and Signorini, A. (2005). The indexable Web is more than 11.5 billion pages. In Proceedings of the 14th International World Wide Web Conference. Chiba, Japan.
Gyöngyi, Z., and Garcia-Molina, H. (2005). Spam: It's not just for inboxes anymore. Computer, 38(10):28–34.
Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. (2004). Combating Web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Databases, pages 576–584.
Haveliwala, T., and Kamvar, S. (2003). The Second Eigenvalue of the Google Matrix. Technical Report 2003-20. Stanford University.
Haveliwala, T. H. (2002). Topic-sensitive PageRank. In Proceedings of the 11th International World Wide Web Conference. Honolulu, Hawaii.
Hawking, D., and Craswell, N. (2001). Overview of the TREC-2001 Web Track. In Proceedings of the 10th Text REtrieval Conference. Gaithersburg, Maryland.
Hawking, D., Upstill, T., and Craswell, N. (2004). Toward better weighting of anchors. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 512–513. Sheffield, England.
Henzinger, M. (2006). Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and development in Information Retrieval, pages 284–291. Seattle, Washington.
Heydon, A., and Najork, M. (1999). Mercator: A scalable, extensible web crawler. World Wide Web, 2(4):219–229.
Ivory, M. Y., and Hearst, M. A. (2002). Statistical profiles of highly-rated Web sites. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 367–374. Minneapolis, Minnesota.
Jansen, B. J., Booth, D., and Spink, A. (2007). Determining the user intent of Web search engine queries. In Proceedings of the 16th International World Wide Web Conference, pages 1149–1150. Banff, Canada.
Jeh, G., and Widom, J. (2003). Scaling personalized Web search. In Proceedings of the 12th International World Wide Web Conference, pages 271–279. Budapest, Hungary.
Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. (2005). Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 154–161. Salvador, Brazil.
Joachims, T., and Radlinski, F. (2007). Search engines that learn from implicit feedback. IEEE Computer, 40(8):34–40.
Jones, R., Rey, B., Madani, O., and Greiner, W. (2006). Generating query substitutions. In Proceedings of the 15th International World Wide Web Conference, pages 387–396. Edinburgh, Scotland.
Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. (2003). Extrapolation methods for accelerating PageRank computations. In Proceedings of the 12th International World Wide Web Conference, pages 261–270. Budapest, Hungary.
Kellar, M., Watters, C., and Shepherd, M. (2007). A field study characterizing web-based information-seeking tasks. Journal of the American Society for Information Science and Technology, 58(7):999–1018.
Kelly, D., and Teevan, J. (2003). Implicit feedback for inferring user preference: A bibliography. ACM SIGIR Forum, 37(2):18–28.
Kleinberg, J. M. (1998). Authoritative sources in a hyperlinked environment. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 668–677. San Francisco, California.
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632.
Langville, A. N., and Meyer, C. D. (2005). A survey of eigenvector methods of Web information retrieval. SIAM Review, 47(1):135–161.
Langville, A. N., and Meyer, C. D. (2006). Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton, New Jersey: Princeton University Press.
Lawrence, S., and Giles, C. L. (1998). Searching the World Wide Web. Science, 280:98–100.
Lawrence, S., and Giles, C. L. (1999). Accessibility of information on the Web. Nature, 400:107–109.
Lee, U., Liu, Z., and Cho, J. (2005). Automatic identification of user goals in Web search. In Proceedings of the 14th International World Wide Web Conference, pages 391–400. Chiba, Japan.
Lempel, R., and Moran, S. (2000). The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1-6):387–401.
Liu, Y., Fu, Y., Zhang, M., Ma, S., and Ru, L. (2007). Automatic search engine performance evaluation with click-through data analysis. In Proceedings of the 16th International World Wide Web Conference Workshop on Query Log Analysis: Social and Technological Challenges, pages 1133–1134. Banff, Canada.
Marchiori, M. (1997). The quest for correct information on the Web: Hyper search engines. In Proceedings of the 6th International World Wide Web Conference. Santa Clara, California.
Metzler, D., Strohman, T., and Croft, W. (2006). Indri TREC notebook 2006: Lessons learned from three Terabyte Tracks. In Proceedings of the 15th Text REtrieval Conference. Gaithersburg, Maryland.
Najork, M., and Wiener, J. L. (2001). Breadth-first search crawling yields high-quality pages. In Proceedings of the 10th International World Wide Web Conference. Hong Kong, China.
Najork, M. A. (2007). Comparing the effectiveness of HITS and SALSA. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, pages 157–164. Lisbon, Portugal.
Najork, M. A., Zaragoza, H., and Taylor, M. J. (2007). HITS on the Web: How does it compare? In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 471–478. Amsterdam, The Netherlands.
Ng, A. Y., Zheng, A. X., and Jordan, M. I. (2001a). Link analysis, eigenvectors and stability. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, pages 903–910. Seattle, Washington.
Ng, A. Y., Zheng, A. X., and Jordan, M. I. (2001b). Stable algorithms for link analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 258–266. New Orleans, Louisiana.
Ntoulas, A., Cho, J., and Olston, C. (2004). What's new on the Web?: The evolution of the web from a search engine perspective. In Proceedings of the 13th International World Wide Web Conference, pages 1–12.
Olston, C., and Najork, M. (2010). Web crawling. Foundations and Trends in Information Retrieval.
Olston, C., and Pandey, S. (2008). Recrawl scheduling based on information longevity. In Proceedings of the 17th International World Wide Web Conference, pages 437–446. Beijing, China.
Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66. Stanford InfoLab.
Pandey, S., and Olston, C. (2008). Crawl ordering by search impact. In Proceedings of the 1st ACM International Conference on Web Search and Data Mining. Palo Alto, California.
Qiu, F., and Cho, J. (2006). Automatic identification of user interest for personalized search. In Proceedings of the 15th International World Wide Web Conference, pages 727–736. Edinburgh, Scotland.
Rafiei, D., and Mendelzon, A. O. (2000). What is this page known for? Computing Web page reputations. In Proceedings of the 9th International World Wide Web Conference, pages 823–835. Amsterdam, The Netherlands.
Richardson, M., and Domingos, P. (2002). The intelligent surfer: Probabilistic combination of link and content information in PageRank. In Advances in Neural Information Processing Systems 14, pages 1441–1448.
Richardson, M., Prakash, A., and Brill, E. (2006). Beyond PageRank: Machine learning for static ranking. In Proceedings of the 15th International World Wide Web Conference, pages 707–715. Edinburgh, Scotland.
Rivest, R. (1992). The MD5 Message-Digest Algorithm. Technical Report 1321. Internet RFC.
Robertson, S., Zaragoza, H., and Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, pages 42–49. Washington, D.C.
Rose, D. E., and Levinson, D. (2004). Understanding user goals in web search. In Proceedings of 13th International World Wide Web Conference, pages 13–19. New York.
Spink, A., and Jansen, B. J. (2004). A study of Web search trends. Webology, 1(2).
Upstill, T., Craswell, N., and Hawking, D. (2003). Query-independent evidence in home page finding. ACM Transactions on Information Systems, 21(3):286–313.
Wolf, J. L., Squillante, M. S., Yu, P. S., Sethuraman, J., and Ozsen, L. (2002). Optimal crawling strategies for Web search engines. In Proceedings of the 11th International World Wide Web Conference, pages 136–147. Honolulu, Hawaii.
Yi, K., Yu, H., Yang, J., Xia, G., and Chen, Y. (2003). Efficient maintenance of materialized top-k views. In Proceedings of the 19th International Conference on Data Engineering, pages 189–200.

Chapter 16: XML Retrieval

Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Wiener, J. (1997). The Lorel query language for semistructured data. International Journal on Digital Libraries, 1(1):68–88.
Al-Khalifa, S., Jagadish, H. V., Patel, J. M., Wu, Y., Koudas, N., and Srivastava, D. (2002). Structural joins: A primitive for efficient XML query pattern matching. In Proceedings of the 18th IEEE International Conference on Data Engineering, pages 141–152.
Ali, M. S., Consens, M. P., Kazai, G., and Lalmas, M. (2008). Structural relevance: A common basis for the evaluation of structured document retrieval. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pages 1153–1162. Napa, California.
Amer-Yahia, S., Koudas, N., Marian, A., Srivastava, D., and Toman, D. (2005). Structure and content scoring for XML. In Proceedings of the 31st International Conference on Very Large Data Bases, pages 361–372. Trondheim, Norway.
Amer-Yahia, S., and Lalmas, M. (2006). XML search: Languages, INEX and scoring. SIGMOD Record, 35(4):16–23.
Bruno, N., Koudas, N., and Srivastava, D. (2002). Holistic twig joins: Optimal XML pattern matching. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 310–321. Madison, Wisconsin.
Chamberlin, D., Robie, J., and Florescu, D. (2000). Quilt: An XML query language for heterogeneous data sources. In Proceedings of WebDB 2000 Conference, pages 53–62.
Chu-Carroll, J., Prager, J., Czuba, K., Ferrucci, D., and Duboue, P. (2006). Semantic search via XML fragments: A high-precision approach to IR. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 445–452. Seattle, Washington.
Clarke, C. L. A. (2005). Controlling overlap in content-oriented XML retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 314–321. Salvador, Brazil.
Cluet, S., Siméoni, J., and De Voluceau, D. (2000). YATL: A functional and declarative language for XML. Bell Labs, Murray Hill, New Jersey.
Denoyer, L., and Gallinari, P. (2006). The Wikipedia XML corpus. ACM SIGIR Forum, 40(1):64–69.
Evjen, B., Sharkey, K., Thangarathinam, T., Kay, M., Vernet, A., and Ferguson, S. (2007). Professional XML (Programmer to Programmer). Indianapolis, Indiana: Wiley.
Fuhr, N., and Großjohann, K. (2001). XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–180. New Orleans, Louisiana.
Fuhr, N., Kamps, J., Lalmas, M., and Trotman, A., editors (2008). Focused Access to XML Documents: Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, volume 4862 of Lecture Notes in Computer Science. Berlin, Germany. Springer.
Fuhr, N., Lalmas, M., Malik, S., and Szlávik, Z., editors (2005). Advances in XML Retrieval: Proceedings of the 3rd International Workshop of the Initiative for the Evaluation of XML Retrieval, volume 3493 of Lecture Notes in Computer Science. Berlin, Germany. Springer.
Fuhr, N., Lalmas, M., and Trotman, A., editors (2007). Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, volume 4518 of Lecture Notes in Computer Science.
Gottlob, G., Koch, C., and Pichler, R. (2005). Efficient algorithms for processing XPath queries. ACM Transactions on Database Systems, 30(2):444–491.
Hockey, S. (2004). The reality of electronic editions. In Modiano, R., Searle, L., and Shillingsburg, P. L., editors, Voice, Text, Hypertext: Emerging Practices in Textual Studies, pages 361–377. Seattle, Washington: University of Washington Press.
Jiang, H., Wang, W., Lu, H., and Yu, J. X. (2003). Holistic twig joins on indexed XML documents. In Proceedings of the 29th International Conference on Very Large Data Bases, pages 273–284. Berlin, Germany.
Kamps, J., de Rijke, M., and Sigurbjörnsson, B. (2004). Length normalization in XML retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 80–87. Sheffield, England.
Kamps, J., Marx, M., de Rijke, M., and Sigurbjörnsson, B. (2006). Articulating information needs in XML query languages. ACM Transactions on Information Systems, 24(4):407–436.
Kaushik, R., Krishnamurthy, R., Naughton, J. F., and Ramakrishnan, R. (2004). On the integration of structure indexes and inverted lists. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pages 779–790. Paris, France.
Kazai, G., and Lalmas, M. (2006). eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval. ACM Transactions on Information Systems, 24(4):503–542.
Kazai, G., Lalmas, M., and de Vries, A. P. (2004). The overlap problem in content-oriented XML retrieval evaluation. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 72–79. Sheffield, England.
Kekäläinen, J., Junkkari, M., Arvola, P., and Aalto, T. (2004). TRIX 2004 — Struggling with the overlap. In Proceedings of INEX 2004, pages 127–139. Dagstuhl, Germany. Published in LNCS 3493, see Fuhr et al. (2005).
Koolen, M., Kazai, G., and Craswell, N. (2009). Wikipedia pages as entry points for book search. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining.
Lehtonen, M. (2006). Preparing heterogeneous XML for full-text search. ACM Transactions on Information Systems, 24(4):455–474.
Lu, J., Ling, T. W., Chan, C. Y., and Chen, T. (2005). From region encoding to extended Dewey: On efficient processing of XML twig pattern matching. In Proceedings of the 31st International Conference on Very Large Data Bases, pages 193–204. Trondheim, Norway.
Mass, Y., and Mandelbrod, M. (2003). Retrieving the most relevant XML components. In Advances in XML Retrieval: Proceedings of the 3rd International Workshop of the Initiative for the Evaluation of XML Retrieval, number 3493 in Lecture Notes in Computer Science, pages 53–58. Berlin, Germany: Springer.
Mass, Y., and Mandelbrod, M. (2004). Component ranking and automatic query refinement for XML retrieval. In Proceedings of INEX 2004, pages 53–58. Dagstuhl, Germany. Published in LNCS 3493, see Fuhr et al. (2005).
Melton, J., and Buxton, S. (2006). Querying XML. San Francisco, California: Morgan Kaufmann.
Pehcevski, J., Thom, J. A., and Vercoustre, A. (2004). Hybrid XML retrieval re-visited. In Proceedings of INEX 2004, pages 153–167. Dagstuhl, Germany. Published in LNCS 3493, see Fuhr et al. (2005).
Piwowarski, B., and Lalmas, M. (2004). Providing consistent and exhaustive relevance assessments for XML retrieval evaluation. In Proceedings of the 13th ACM Conference on Information and Knowledge Management, pages 361–370. Washington, D.C.
Piwowarski, B., Trotman, A., and Lalmas, M. (2008). Sound and complete relevance assessment for XML retrieval. ACM Transactions on Information Systems, 27(1):1–37.
Theobald, M., Bast, H., Majumdar, D., Schenkel, R., and Weikum, G. (2008). Topx: Efficient and versatile top-k query processing for semistructured data. The VLDB Journal, 17(1):81–115.
Tittel, E., and Dykes, L. (2005). XML for Dummies (4th ed.). New York: Wiley.
Trotman, A. (2004). Searching structured documents. Information Processing & Management, 40(4):619–632.
Trotman, A., and Sigurbjörnsson, B. (2004). Narrowed Extended XPath I (NEXI). In Proceedings of INEX 2004. Dagstuhl, Germany. Published in LNCS 3493, see Fuhr et al. (2005).
van der Vlist, E. (2002). XML Schema: The W3C's Object-Oriented Descriptions for XML. Sebastopol, California: O'Reilly.
Vittaut, J., Piwowarski, B., and Gallinari, P. (2004). An algebra for structured queries in Bayesian networks. In Proceedings of INEX 2004, pages 100–112. Dagstuhl, Germany. Published in LNCS 3493, see Fuhr et al. (2005).
Wu, H., Kazai, G., and Taylor, M. (2008). Book search experiments: Investigating IR methods for the indexing and retrieval of books. In Proceedings of the 30th European Conference on Information Retrieval Research, pages 234–245.
Zhang, C., Naughton, J., DeWitt, D., Luo, Q., and Lohman, G. (2001). On supporting containment queries in relational database management systems. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 425–436. Santa Barbara, California.
Zhang, N., Özsu, M. T., Ilyas, I. F., and Aboulnaga, A. (2006). FIX: Feature-based indexing technique for XML documents. In Proceedings of the 32nd International Conference on Very Large Data Bases, pages 259–270. Seoul, South Korea.