Search code examples
search-engineinformation-retrieval

Proximity Search in Search Engines


Please tell me why search engines do not exploit proximity in ranking their pages. What are the limitations that hinder search engines to not use proximity explicitly.


Solution

  • To directly use proximity information, an index needs to store the position for each term within a document as a part of the postings list for each term. The size of the postings list for positional indexing is typically 4x-5x of the size of standard indexing. This not only uses up extra I/O resources, but also can lead to slow retrieval time, since the retrieval scoring now has to take into consideration the position of each match (query term with document term) as well.

    But a search engine can't simply ignore term proximity because it plays an important role in capturing latent semantic concepts, specially for the multi-word expressions. A standard and efficient solution is thus to compile a list of most common phrases for a collection and indexing these phrases as a whole (i.e. treating them as separate terms in the inverted list). For example, a search engine might have separate postings lists for the terms "German", "Shepherd" and the phrase "German Shepherd". This ensures that documents which contain the phrase "German shepherd" are ranked better than those with matches for only German or shepherd.