Search code examples
lucenestartswith

Lucens best way to do "starts-with" queries


I want to be able to do the following types of queries:

The data to index consists of (let's say), music videos where only the title is interesting. I simply want to index these and then create queries for them such that, whatever word or words the user used in the query, the documents containing those words, in that order, at the beginning of the tile will be returned first, followed (in no particular order) by documents containing at least one of the searched words in any position of the title. Also all this should be case insensitive.

Example:

For documents:

  • Video1Title = Sea is blue
  • Video2Title = Wild sea
  • Video3Title = Wild sea Whatever
  • Video4Title = Seaside Whatever

If I search "sea" I want to get

  • "Video1Title = Sea is blue"

first followed by all the other documents that contain "sea" in title, but not at the beginning.

If I search "Wild sea" I want to get

  • Video2Title = Wild sea
  • Video3Title = Wild sea Whatever

first followed by all the other documents that have "Wild" or "Sea" in their title but don't have "Wild Sea" as title prefix.

If I search "Seasi" I don't wanna get anything (I don't care for Keyword Tokenization and prefix queries).

Now AFAIKS, there's no actual way to tell Lucene "find me documents where word1 and word2 and etc. are in positions 1 and 2 and 3 and etc."

There are "workarounds" to simulate that behaviour:

  • Index the field twice. In field1 you have the words tokenized (using perhaps StandardAnalyzer) and in field2 you have them all clumped up into one element (using KeywordAnalyzer). Then if you search something like :

    +(field1:word1 word2 word3) (field2:"word1 word2 word3*")

effectively telling Lucene "Documents must contain word1 or word2 or word3 in the title, and furthermore those that match "title starts with >word1 word2 word3<" are better (get higher score).

  • Add a "lucene_start_token" to the beginning of the field when indexing them such that Video2Title = Wild sea is indexed as "title:lucene_start_token Wild sea" and so on for the rest

Then do a query such that:

+(title:sea) (title:"lucene_start_token sea")

and having Lucene return all documents which contain my search word(s) in the title and also give a better score on those who matched "lucene_start_token+search words"

My question is then, are there indeed better ways to do this (maybe using PhraseQuery and Term position)? If not, which of the above is better perfromance-wise?


Solution

  • You can use Lucene Payloads for that. You can give custom boost for every term of the field value.

    So, when you index your titles you can start using a boost factor of 3 (for example):

    title: wild|3.0 creatures|2.5 blue|2.0 sea|1.5

    title: sea|3.0 creatures|2.5

    Indexing this way you are boosting nearest terms to the start of title.

    The main problem using this approach is you have to tokenize by yourself and add all this boost information "manually" as the Analyzer needs the text structured that way (term1|1.1 term2|3.0 term3).