Search code examples
elasticsearchelastic-stackelasticsearch-5

Creating a whitespace character filter


I want to use a custom analyzer with a pattern tokenizer and a custom token filter. But, before that step, I want to make the tokens on each whitespace. I know, I can use the whitespace analyzer but I also want to use my custom analyzer.

Basically, I want to generate a token on each special character and whitespace in a string.

For example, I have a string "Google's url is https://www.google.com/."

My tokens should be like "Google", "Google'", "Google's", "url", "is", "https", "https:", "https:/", "://", "//www","/www."... and so on.

Basically, I want to be my tokens like that of n-gram but only a limited one like the below which will break only on special character.

My tokenizerFactory files looks like this:

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.pattern.PatternTokenizer;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;

import java.util.regex.Pattern;

public class UrlTokenizerFactory extends AbstractTokenizerFactory {

    private final Pattern pattern;
    private final int group;

    public UrlTokenizerFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
        super(indexSettings, name, settings);

        String sPattern = settings.get("pattern", "[^\\p{L}\\p{N}]");
        if (sPattern == null) {
            throw new IllegalArgumentException("pattern is missing for [" + name + "] tokenizer of type 'pattern'");
        }

        this.pattern = Regex.compile(sPattern, settings.get("flags"));
        this.group = settings.getAsInt("group", -1);
    }

    @Override
    public Tokenizer create() {
        return new PatternTokenizer(pattern, group);
    }
}

My TokenFilterfactory file is currently empty.


Solution

  • You can simply use the whitespace tokenizer in your custom analyzer definition, below is the example of custom_analyzer which uses it.

    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_custom_analyzer": { --> name of custom analyzer
                        "type": "custom",
                        "tokenizer": "whitespace", --> note this
                        "filter": [
                            "lowercase"
                        ]
                    }
                }
            }
        },
        "mappings": {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "my_custom_analyzer" --> note this
                }
            }
        }
    }