Search code examples
searchindexinglucenesearch-engineinverted-index

Efficient low-cardinality ANDs in a search engine


How do search engines such as Lucene, etc. perform AND queries where a term is common to many documents in the dataset? For example, in an inverted index of:

term    | document_id
---------------------
program | 1, 2, 3, 5...
python  | 1, 4
code    | 4
c++     | 4, 5

the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.

Is there a way to perform AND queries without having to take the intersection of terms contained by potentially billions of documents?


Solution

  • the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.

    Yes. Say you have the following query:

    term1 AND term2 AND term3

    You first need to compute the document frequency of each positive term. You pick the word with the lowest count.

    You retrieve the documents that contains the least common term from the query. Those are the candidates. Then you filter and score those candidates with the query with a finite state machine.

    So the database has several subspace:

    1. A mapping from lemma or stem or term to document frequency (as in tfidf)
    2. The actual inverted-index that allows to retrieve the documents that contains a given lemma
    3. A mapping between the document id and the full text representation of the document or just bag of words depending on how advanced is your query logic.

    Then the filter + score step can happen in parallel