Solr - How to tokenize words in a string in a compounding "word-1, word-1 + word-2, word-1 + word-2 ... word-n" manner?

I want to tokenize a string such as Best Beat Makers to generate tokens per word in an almost NGram-like fashion, for example:

IN:  "Best Beat Makers"
OUT: ["Best", "Beat", "Makers", "Best Beat", "Best Beat Makers"]
                                     ^               ^
                                     |               |
                              How can I generate these tokens?

The result should not include "Beat Makers" because I only want to tokenize words in an compounding fashion (e.g. word1, word1 + word2, word1 + word2 + word3, etc) and not in combination (e.g. word1, word1 + word2, word2 + word3, etc).

Currently, I am only able to generate the first three tokens by using StandardTokenizerFactory or ClassicTokenizerFactory, and the traditional NGramTokenizerFactory only works for characters of a word (and is a bit expensive on indexing).

One option I've considered is using StandardTokenizerFactory to get the first three tokens and then creating a copyField to another field that uses a PatternTokenizerFactory with a regex defined to get the last two tokens, but I would prefer to get the tokens I need using only one field if possible.

If you are more familiar with ElasticSearch, I would still like to hear your thoughts since the tokenizers between Solr and ES are more or less similar and might push me in the right direction. Thanks!

Solution

Shingle Filter : This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a single token.

You use the below property as well.

maxShingleSize : (integer, must be >= minShingleSize, default 2) The maximum number of tokens per shingle.

Here is the fieldtype applied.

<fieldType name="text_tokens" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
        </analyzer>
    </fieldType>

Input is : "Welcome to Apache Solr"

The expected output is :

Unigram: "Welcome", "to", "Apache", "Solr"
Bigram: "Welcome to", "to Apache", "Apache Solr"
Trigram: "Welcome to Apache", "to Apache Solr"

Below is the analysis for you the text you shared.

Inputs is : Best Beat Makers