Search code examples
algorithmsearchfull-text-searchsearch-engineinformation-retrieval

Boolean vs non-boolean search


I'm having some issues trying to understand what a non-boolean search (if that makes sense) is. As far as I've figured out, a boolean search allows the use of AND, OR and NOT in the query. But what about the non-boolean search? I've read somewhere that that means you can search substrings, not just complete words, but I wanted to make sure I fully understand what each of one means. Also, maybe an example would help, like, is Google boolean or non-boolean? What about Oracle Text or Apahe Solr?


Solution

  • Non boolean search includes approaches that are not purely boolean model techniques1.

    Probably the most common example of such a method is the vector-space model.
    In this model, each document is a vector, represented by the words (or bi-grams,...) it contains. The dimension of each document is the number of terms in the vocabulary.

    The similarity in this model is done by creating a 'fake' document - which is the query, and comparing this fake document to any other document in the corpus. The more similar the document is to the query - the better the result is.

    A common similarity measure is cosine-similarity.
    This model goes well with the tf-idf model (The td-idf determines what is the value in each entry of each vector).

    Note that this is NOT boolean model. You do not oporate on 'sets', you compare similarity of vectors - this is entirely different model.
    In addition, this method has an important advantage - it returns a score associated to each document, and not only a boolean answer "relevant" or "not relevant".

    As-is vector-space does not allow AND,OR oporations, however this is easily solveable by doing 2-phase search. The first is using a boolean model to get candidates, and the 2nd is using vector-space to get a score for each document.


    Other models are building a language model out of a document - a language model is described as P(word|M) = the probability of the model M to generate the word.

    A common language model is P(word|document) = #occurances(word,document)/|document|
    To avoid having probability zero - we usually add smoothing technique. a common technique is:

    P(word|document) =  alpha*#occurances(word,document)/|document| + (1-alpha)*#occurances(word,corpus)/|corpus|
    

    Now - when we have a query of more than one term: q=t1 t2 ... tn, we calculate:

    P(q|d) = P(t1|d)*P(t2|d)*...*P(tn|d)
    

    Note that this model actually allows AND semantics by setting alpha=1, and OR semantics by alpha!=1.


    (1) Boolean search is basically set terminology:

    Each term is associated with a set that contains all the documents that have this term in them. Now, you simply do set oporations on bunch of sets. AND is intersection, OR is union.