I'm having some issues trying to understand what a non-boolean search (if that makes sense) is. As far as I've figured out, a boolean search allows the use of AND, OR and NOT in the query. But what about the non-boolean search? I've read somewhere that that means you can search substrings, not just complete words, but I wanted to make sure I fully understand what each of one means. Also, maybe an example would help, like, is Google boolean or non-boolean? What about Oracle Text or Apahe Solr?
Non boolean search includes approaches that are not purely boolean model techniques1.
Probably the most common example of such a method is the vector-space model.
In this model, each document is a vector, represented by the words (or bi-grams,...) it contains. The dimension of each document is the number of terms in the vocabulary.
The similarity in this model is done by creating a 'fake' document - which is the query, and comparing this fake document to any other document in the corpus. The more similar the document is to the query - the better the result is.
A common similarity measure is cosine-similarity.
This model goes well with the tf-idf model (The td-idf determines what is the value in each entry of each vector).
Note that this is NOT boolean model. You do not oporate on 'sets', you compare similarity of vectors - this is entirely different model.
In addition, this method has an important advantage - it returns a score associated to each document, and not only a boolean answer "relevant" or "not relevant".
As-is vector-space does not allow AND,OR oporations, however this is easily solveable by doing 2-phase search. The first is using a boolean model to get candidates, and the 2nd is using vector-space to get a score for each document.
Other models are building a language model out of a document - a language model is described as P(word|M) = the probability of the model M to generate the word
.
A common language model is P(word|document) = #occurances(word,document)/|document|
To avoid having probability zero - we usually add smoothing technique. a common technique is:
P(word|document) = alpha*#occurances(word,document)/|document| + (1-alpha)*#occurances(word,corpus)/|corpus|
Now - when we have a query of more than one term: q=t1 t2 ... tn
, we calculate:
P(q|d) = P(t1|d)*P(t2|d)*...*P(tn|d)
Note that this model actually allows AND semantics by setting alpha=1
, and OR semantics by alpha!=1
.
(1) Boolean search is basically set terminology:
Each term is associated with a set that contains all the documents that have this term in them. Now, you simply do set oporations on bunch of sets. AND is intersection, OR is union.