Search code examples
search-enginelucene.net

Lucene partial word matching


Lucene does not support it out of the box, so I need some help building my query.

Lets say I have the document with a field value "Develop"

I would like this document to be returned for the searches "Dev" and "lop".

Maybe creating two queries?

"*keyword" 

and

"keyword*" 

and

"keyword"

?

How would you go about doing this with multiple words? Would you split the sentence/search into a words list and do the previous example for each word?


Solution

  • What you're asking is if I understand you correctly not feasible on any large scale search engine.
    Lucene creates an index over keywords using term-document matrix and inverted-file techniques (see links at the bottom). A fully fledged string matching might be very nice to have, but it does not scale: you will never be able to query a decently sized index (say more than a couple of dozen/hundreds of documents) in an acceptable time.

    Still, here are two ideas that might help...

    Syllable tokenization
    To come back to your example with 'Develop'. As long as you are happy with letting users search for syllables I guess you can do something. You would have to create use tokenizer that splits up words in your indexed according to their syllables and create a database index over the syllables. (I am not sure there are built in tokenizers for the English language that can do that and writing one on your own might be tricky...)

    An important thing to note:
    If you would index the full words AND the seperate syllables the size of your index will be much larger than if you only index one of the two.

    However I would not suggest to index only syllables. If you want to also allow your users to search for the full word 'Develop' (which I guess you want) this would result in two queries with a logical and between them, namely <'dev' AND 'lop'>. Although Lucene supports such logical constructs in queries they are very expensive. I have personally had some trouble in the past using logical queries in Lucene.

    Stemming
    Another way to somehow arrive at what you're trying could be to use a brutal form of word stemming (http://en.wikipedia.org/wiki/Stemming) that stems words to their first syllable. (This would allow to search for 'dev' but not for 'lop'...)
    Again, I don't think such a word stem feature is already in Lucene. Writing one for yourself will be a pain and involve working with/importing huge dictionaries.

    Links
    These might be looking into if you don't know about search engine internals:
    http://en.wikipedia.org/wiki/Index_%28search_engine%29
    http://en.wikipedia.org/wiki/Vector_space_model
    http://en.wikipedia.org/wiki/Inverted_file
    http://en.wikipedia.org/wiki/Term-document_matrix
    http://en.wikipedia.org/wiki/Tf-idf