algorithm search full-text-search search-engine information-retrieval

Boolean vs non-boolean search

I'm having some issues trying to understand what a non-boolean search (if that makes sense) is. As far as I've figured out, a boolean search allows the use of AND, OR and NOT in the query. But what about the non-boolean search? I've read somewhere that that means you can search substrings, not just complete words, but I wanted to make sure I fully understand what each of one means. Also, maybe an example would help, like, is Google boolean or non-boolean? What about Oracle Text or Apahe Solr?

Solution

Non boolean search includes approaches that are not purely boolean model techniques¹.

Probably the most common example of such a method is the vector-space model.
In this model, each document is a vector, represented by the words (or bi-grams,...) it contains. The dimension of each document is the number of terms in the vocabulary.

The similarity in this model is done by creating a 'fake' document - which is the query, and comparing this fake document to any other document in the corpus. The more similar the document is to the query - the better the result is.

A common similarity measure is cosine-similarity.
This model goes well with the tf-idf model (The td-idf determines what is the value in each entry of each vector).

Note that this is NOT boolean model. You do not oporate on 'sets', you compare similarity of vectors - this is entirely different model.
In addition, this method has an important advantage - it returns a score associated to each document, and not only a boolean answer "relevant" or "not relevant".

As-is vector-space does not allow AND,OR oporations, however this is easily solveable by doing 2-phase search. The first is using a boolean model to get candidates, and the 2nd is using vector-space to get a score for each document.

Other models are building a language model out of a document - a language model is described as P(word|M) = the probability of the model M to generate the word.

A common language model is P(word|document) = #occurances(word,document)/|document|
To avoid having probability zero - we usually add smoothing technique. a common technique is:

P(word|document) =  alpha*#occurances(word,document)/|document| + (1-alpha)*#occurances(word,corpus)/|corpus|

Now - when we have a query of more than one term: q=t1 t2 ... tn, we calculate:

P(q|d) = P(t1|d)*P(t2|d)*...*P(tn|d)

Note that this model actually allows AND semantics by setting alpha=1, and OR semantics by alpha!=1.

(1) Boolean search is basically set terminology:

Each term is associated with a set that contains all the documents that have this term in them. Now, you simply do set oporations on bunch of sets. AND is intersection, OR is union.