Information Retrieval: Implementing and Evaluating Search Engines        

Authors & Publisher

Stefan Büttcher, Google Inc.
Charles L. A. Clarke, Univ. of Waterloo
Gordon V. Cormack, Univ. of Waterloo
 
MIT Press, 2010
(publisher's website)

Getting the Book


About the Book

Information retrieval is the foundation for modern search engines. This textbook offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects. Wumpus, a multi-user open-source information retrieval system developed by one of the authors and available online, provides model implementations and a basis for student work.
The modular structure of the book allows instructors to use it in a variety of graduate-level courses, including courses taught from a database systems implementation perspective, traditional information retrieval courses with a focus on IR theory, and courses covering the basics of Web retrieval. Additionally, professionals in computer science, computer engineering, and software engineering will find Information Retrieval a valuable reference.
After an introduction to the basics of information retrieval, the text covers three major topic areas — indexing, retrieval, and evaluation — in self-contained parts. The final part of the book draws on and extends the general material in the earlier parts, treating specific application areas, including parallel search engines, link analysis, crawling, and information retrieval over collections of XML documents. End-of-chapter references point to further reading; end-of-chapter exercises range from pencil and paper problems to substantial programming projects.

Table of Contents (Incl. Sample Chapters)

—   I  Foundations   —
1Introduction  
1.1  What Is Information Retrieval?
1.1.1  Web Search
1.1.2  Other Search Applications
1.1.3  Other IR Applications
1.2  Information Retrieval Systems
1.2.1  Basic IR System Architecture
1.2.2  Documents and Update
1.2.3  Performance Evaluation
1.3  Working with Electronic Text
1.3.1  Text Formats
1.3.2  A Simple Tokenization of English Text
1.3.3  Term Distributions
1.3.4  Language Modeling
1.4  Test Collections
1.4.1  TREC Tasks
1.5  Open-Source IR Systems
1.5.1  Lucene
1.5.2  Indri
1.5.3  Wumpus
1.6  Further Reading
1.7  Exercises
1.8  Bibliography
2Basic Techniques  
2.1  Inverted Indices
2.1.1  Extended Example: Phrase Search
2.1.2  Implementing Inverted Indices
2.1.3  Documents and Other Elements
2.2  Retrieval and Ranking
2.2.1  The Vector Space Model
2.2.2  Proximity Ranking
2.2.3  Boolean Retrieval
2.3  Evaluation
2.3.1  Recall and Precision
2.3.2  Effectiveness Measures for Ranked Retrieval
2.3.3  Building a Test Collection
2.3.4  Efficiency Measures
2.4  Summary
2.5  Further Reading
2.6  Exercises
2.7  Bibliography
3Tokens and Terms  
3.1  English
3.1.1  Punctuation and Capitalization
3.1.2  Stemming
3.1.3  Stopping
3.2  Characters
3.3  Character N-Grams
3.4  European Languages
3.5  CJK Languages
3.6  Further Reading
3.7  Exercises
3.8  Bibliography
—   II  Indexing   —
4Static Inverted Indices  
4.1  Index Components and Index Life Cycle
4.2  The Dictionary
4.3  Postings Lists
4.4  Interleaving Dictionary and Postings Lists
4.5  Index Construction
4.5.1  In-Memory Index Construction
4.5.2  Sort-Based Index Construction
4.5.3  Merge-Based Index Construction
4.6  Other Types of Indices
4.7  Summary
4.8  Further Reading
4.9  Exercises
4.10  Bibliography
5Query Processing  
5.1  Query Processing for Ranked Retrieval
5.1.1  Document-at-a-Time Query Processing
5.1.2  Term-at-a-Time Query Processing
5.1.3  Precomputing Score Contributions
5.1.4  Impact Ordering
5.1.5  Static Index Pruning
5.2  Lightweight Structure
5.2.1  Generalized Concordance Lists
5.2.2  Operators
5.2.3  Examples
5.2.4  Implementation
5.3  Further Reading
5.4  Exercises
5.5  Bibliography
6Index Compression  
6.1  General-Purpose Data Compression
6.2  Symbolwise Data Compression
6.2.1  Modeling and Coding
6.2.2  Huffman Coding
6.2.3  Arithmetic Coding
6.2.4  Symbolwise Text Compression
6.3  Compressing Postings Lists
6.3.1  Nonparametric Gap Compression
6.3.2  Parametric Gap Compression
6.3.3  Context-Aware Compression Methods
6.3.4  Index Compression for High Query Performance
6.3.5  Compression Effectiveness
6.3.6  Decoding Performance
6.3.7  Document Reordering
6.4  Compressing the Dictionary
6.5  Summary
6.6  Further Reading
6.7  Exercises
6.8  Bibliography
7Dynamic Inverted Indices  
7.1  Batch Updates
7.2  Incremental Index Updates
7.2.1  Contiguous Inverted Lists
7.2.2  Noncontiguous Inverted Lists
7.3  Document Deletions
7.3.1  Invalidation List
7.3.2  Garbage Collection
7.4  Document Modifications
7.5  Discussion and Further Reading
7.6  Exercises
7.7  Bibliography
—   III  Retrieval and Ranking   —
8Probabilistic Retrieval  
8.1  Modeling Relevance
8.2  The Binary Independence Model
8.3  The Robertson/Spärck Jones Weighting Formula
8.4  Term Frequency
8.4.1  Bookstein's Two-Poisson Model
8.4.2  Approximating the Two-Poisson Model
8.4.3  Query Term Frequency
8.5  Document Length: BM25
8.6  Relevance Feedback
8.6.1  Term Selection
8.6.2  Pseudo-Relevance Feedback
8.7  Field Weights: BM25F
8.8  Experimental Comparison
8.9  Further Reading
8.10  Exercises
8.11  Bibliography
9Language Modeling and Related Methods  
9.1  Generating Queries From Documents
9.2  Language Models and Smoothing
9.3  Ranking with Language Models
9.4  Kullback-Leibler Divergence
9.5  Divergence from Randomness
9.5.1  A Model of Randomness
9.5.2  Eliteness
9.5.3  Document Length Normalization
9.6  Passage Retrieval and Ranking
9.6.1  Passage Scoring
9.6.2  Implementation
9.7  Experimental Comparison
9.8  Further Reading
9.9  Exercises
9.10  Bibliography
10Categorization and Filtering  
10.1  Detailed Examples
10.1.1  Topic-Oriented Batch Filtering
10.1.2  On-Line Filtering
10.1.3  Learning from Historical Examples
10.1.4  Language Categorization
10.1.5  On-Line Adaptive Spam Filtering
10.1.6  Threshold Choice for Binary Categorization
10.2  Classification
10.2.1  Odds and Odds Ratios
10.2.2  Building Classifiers
10.2.3  Learning Modes
10.2.4  Feature Engineering
10.3  Probabilistic Classifiers
10.3.1  Probability Estimates
10.3.2  Combining Probability Estimates
10.3.3  Practical Considerations
10.4  Linear Classifiers
10.4.1  Perceptron Algorithm
10.4.2  Support Vector Machines
10.5  Similarity-Based Classifiers
10.5.1  Rocchio's Method
10.5.2  Memory-Based Methods
10.6  Generalized Linear Models
10.6.1  Kernel Methods
10.7  Information-Theoretic Models
10.7.1  Comparing Models
10.7.2  Sequential Compressions Models
10.7.3  Decision Trees and Stumps
10.8  Experimental Comparison
10.8.1  Topic-Oriented On-Line Filtering
10.8.2  On-Line Adaptive Spam Filtering
10.9  Further Reading
10.10  Exercises
10.11  Bibliography
11Fusion and Metalearning  
11.1  Search-Result Fusion
11.1.1  Fixed-Cutoff Aggregation
11.1.2  Rank and Score Aggregation
11.2  Stacking Adaptive Filters
11.3  Stacking Batch Classifiers
11.3.1  Holdout Validation
11.3.2  Cross-Validation
11.4  Bagging
11.5  Boosting
11.6  Multicategory Ranking and Classification
11.6.1  Document Versus Category Scores
11.6.2  Document Versus Category Rank Fusion
11.6.3  Multicategory Methods
11.7  Learning to Rank
11.7.1  What Is Learning to Rank?
11.7.2  Learning-to-Rank Methods
11.7.3  What to Optimize?
11.7.4  Learning to Rank for Categorization
11.7.5  Learning for Ranked IR
11.7.6  The LETOR Data Set
11.8  Further Reading
11.9  Exercises
11.10  Bibliograhy
—   IV  Evaluation   —
12Measuring Effectiveness  
12.1  Traditional Effectiveness Measures
12.1.1  Recall and Precision
12.1.2  Precision at k Documents (P@k)
12.1.3  Average Precision (AP)
12.1.4  Reciprocal Rank (RR)
12.1.5  Arithmetic Versus Geometric Mean
12.1.6  User Satisfaction
12.2  The Text REtrieval Conference
12.3  Using Statistics in Evaluation
12.3.1  Foundations and Terminology
12.3.2  Confidence Intervals
12.3.3  Comparative Evaluation
12.3.4  Hypothesis Tests Considered Harmful
12.3.5  Paired and Unpaired Differences
12.3.6  Significance Tests
12.3.7  Validity and Power of Statistical Tests
12.3.8  Reporting the Precision of Measurement
12.3.9  Meta-Analysis
12.4  Minimizing Adjudication Effort
12.4.1  Selecting Documents for Adjudication
12.4.2  Sampling the Pool
12.5  Nontraditional Effectiveness Measures
12.5.1  Graded Relevance
12.5.2  Incomplete and Biased Judgments
12.5.3  Novelty and Diversity
12.6  Further Reading
12.7  Exercises
12.8  Bibliography
13Measuring Efficiency  
13.1  Efficiency Criteria
13.1.1  Throughput and Latency
13.1.2  Aggregate Statistics and User Satisfaction
13.2  Queueing Theory
13.2.1  Kendall's Notation
13.2.2  The M/M/1 Queueing Model
13.2.3  Latency Quantiles and Average Utilization
13.3  Query Scheduling
13.4  Caching
13.4.1  Three-Level Caching
13.4.2  Cache Policies
13.4.3  Prefetching Search Results
13.5  Further Reading
13.6  Exercises
13.7  Bibliography
—   V  Applications and Extensions   —
14Parallel Information Retrieval  
14.1  Parallel Query Processing
14.1.1  Document Partitioning
14.1.2  Term Partitioning
14.1.3  Hybrid Schemes
14.1.4  Redundancy and Fault Tolerance
14.2  MapReduce
14.2.1  The Basic Framework
14.2.2  Combiners
14.2.3  Secondary Keys
14.2.4  Machine Failures
14.3  Further Reading
14.4  Exercises
14.5  Bibliography
15Web Search  
15.1  The Structure of the Web
15.1.1  The Web Graph
15.1.2  Static and Dynamic Pages
15.1.3  The Hidden Web
15.1.4  The Size of the Web
15.2  Queries and Users
15.2.1  User Intent
15.2.2  Clickthrough Curves
15.3  Static Ranking
15.3.1  Basic PageRank
15.3.2  Extended PageRank
15.3.3  Properties of PageRank
15.3.4  Other Links Analysis Methods: HITS and SALSA
15.3.5  Other Static Ranking Methods
15.4  Dynamic Ranking
15.4.1  Anchor Text
15.4.2  Novelty
15.5  Evaluating Web Search
15.5.1  Named Page Finding
15.5.2  Implicit User Feedback
15.6  Web Crawlers
15.6.1  Components of a Crawler
15.6.2  Crawl Order
15.6.3  Duplicates and Near-Duplicates
15.7  Summary
15.8  Further Reading
15.8.1  Link Analysis
15.8.2  Anchor Text
15.8.3  Implicit Feedback
15.8.4  Web Crawlers
15.9  Exercises
15.10  Bibliography
16XML Retrieval  
16.1  The Essence of XML
16.1.1  Document Type Definitions
16.1.2  XML Schema
16.2  Paths, Trees, and FLWORs
16.2.1  XPath
16.2.2  NEXI
16.2.3  XQuery
16.3  Indexing and Query Processing
16.4  Ranked Retrieval
16.4.1  Ranking Elements
16.4.2  Overlapping Elements
16.4.3  Retrievable Elements
16.5  Evaluation
16.5.1  Test Collections
16.5.2  Effectiveness Measures
16.6  Further Reading
16.7  Exercises
16.8  Bibliography
—   VI  Appendix   —
AComputer Performance  
A.1  Sequential Versus Random Access on Disk
A.2  Sequential Versus Random Access in RAM
A.3  Pipelined Execution and Branch Prediction

Additional Resources