Search code examples
solrlucenen-gram

Apache Solr word level ngram


I have to configure Solr for word level ngram (uni, bi and trigram). For example, if input (Index or query) is as follows:

"Welcome to Apache Solr" It should be tokenized as

Unigram: "Welcome", "to", "Apache", "Solr"
Bigram: "Welcome to", "to Apache", "Apache Solr"
Trigram: "Welcome to Apache", "to Apache Solr"

How should I get this from Solr. I have consulted default guide of Solr, I have not find word level tokenizer.


Solution

  • You can use the Shingle Filter here.

    This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a single token.

    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ShingleFilterFactory"/>
    </analyzer>
    

    In: "To be, or what?"

    Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)

    Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)

    you use the below property as well.

    maxShingleSize : (integer, must be >= minShingleSize, default 2) The maximum number of tokens per shingle.

    I tried for the text you requested.

    Here is the fieldtype applied.

    <fieldType name="text_tokens" class="solr.TextField" positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="true"/>
            </analyzer>
        </fieldType>
    

    The expected output is :

    Unigram: "Welcome", "to", "Apache", "Solr"
    Bigram: "Welcome to", "to Apache", "Apache Solr"
    Trigram: "Welcome to Apache", "to Apache Solr"
    

    The output given after applying the above fieldtype is : Solr Analysis Page

    Here is covers all the expected tokens like

    unigram : Welcome, to , Apache , Solr
    bigram : Welcome to , to Apache, Apache Solr 
    trigram : Welcome to Apache , to Apache Solr
    

    For more details please refer the below link. Shingle Filter Example