elasticsearch elastic-stack elasticsearch-5

Creating a whitespace character filter

I want to use a custom analyzer with a pattern tokenizer and a custom token filter. But, before that step, I want to make the tokens on each whitespace. I know, I can use the whitespace analyzer but I also want to use my custom analyzer.

Basically, I want to generate a token on each special character and whitespace in a string.

For example, I have a string "Google's url is https://www.google.com/."

My tokens should be like "Google", "Google'", "Google's", "url", "is", "https", "https:", "https:/", "://", "//www","/www."... and so on.

Basically, I want to be my tokens like that of n-gram but only a limited one like the below which will break only on special character.

My tokenizerFactory files looks like this:

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.pattern.PatternTokenizer;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;

import java.util.regex.Pattern;

public class UrlTokenizerFactory extends AbstractTokenizerFactory {

    private final Pattern pattern;
    private final int group;

    public UrlTokenizerFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
        super(indexSettings, name, settings);

        String sPattern = settings.get("pattern", "[^\\p{L}\\p{N}]");
        if (sPattern == null) {
            throw new IllegalArgumentException("pattern is missing for [" + name + "] tokenizer of type 'pattern'");
        }

        this.pattern = Regex.compile(sPattern, settings.get("flags"));
        this.group = settings.getAsInt("group", -1);
    }

    @Override
    public Tokenizer create() {
        return new PatternTokenizer(pattern, group);
    }
}

My TokenFilterfactory file is currently empty.

Solution

You can simply use the whitespace tokenizer in your custom analyzer definition, below is the example of custom_analyzer which uses it.

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_custom_analyzer": { --> name of custom analyzer
                    "type": "custom",
                    "tokenizer": "whitespace", --> note this
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_custom_analyzer" --> note this
            }
        }
    }
}