Search code examples
searchsolrtokenize

Solr - How to tokenize words in a string in a compounding "word-1, word-1 + word-2, word-1 + word-2 ... word-n" manner?


I want to tokenize a string such as Best Beat Makers to generate tokens per word in an almost NGram-like fashion, for example:

IN:  "Best Beat Makers"
OUT: ["Best", "Beat", "Makers", "Best Beat", "Best Beat Makers"]
                                     ^               ^
                                     |               |
                              How can I generate these tokens?

The result should not include "Beat Makers" because I only want to tokenize words in an compounding fashion (e.g. word1, word1 + word2, word1 + word2 + word3, etc) and not in combination (e.g. word1, word1 + word2, word2 + word3, etc).

Currently, I am only able to generate the first three tokens by using StandardTokenizerFactory or ClassicTokenizerFactory, and the traditional NGramTokenizerFactory only works for characters of a word (and is a bit expensive on indexing).

One option I've considered is using StandardTokenizerFactory to get the first three tokens and then creating a copyField to another field that uses a PatternTokenizerFactory with a regex defined to get the last two tokens, but I would prefer to get the tokens I need using only one field if possible.

If you are more familiar with ElasticSearch, I would still like to hear your thoughts since the tokenizers between Solr and ES are more or less similar and might push me in the right direction. Thanks!


Solution

  • Shingle Filter : This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a single token.

    You use the below property as well.

    maxShingleSize : (integer, must be >= minShingleSize, default 2) The maximum number of tokens per shingle.

    Here is the fieldtype applied.

    <fieldType name="text_tokens" class="solr.TextField" positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
            </analyzer>
        </fieldType>
    

    Input is : "Welcome to Apache Solr"

    The expected output is :

    Unigram: "Welcome", "to", "Apache", "Solr"
    Bigram: "Welcome to", "to Apache", "Apache Solr"
    Trigram: "Welcome to Apache", "to Apache Solr"
    

    Below is the analysis for you the text you shared.

    Inputs is : Best Beat Makers

    image