I want to use a custom analyzer with a pattern tokenizer and a custom token filter. But, before that step, I want to make the tokens on each whitespace. I know, I can use the whitespace analyzer but I also want to use my custom analyzer.
Basically, I want to generate a token on each special character and whitespace in a string.
For example, I have a string "Google's url is https://www.google.com/."
My tokens should be like "Google", "Google'", "Google's", "url", "is", "https", "https:", "https:/", "://", "//www","/www."... and so on.
Basically, I want to be my tokens like that of n-gram but only a limited one like the below which will break only on special character.
My tokenizerFactory files looks like this:
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.pattern.PatternTokenizer;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;
import java.util.regex.Pattern;
public class UrlTokenizerFactory extends AbstractTokenizerFactory {
private final Pattern pattern;
private final int group;
public UrlTokenizerFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
super(indexSettings, name, settings);
String sPattern = settings.get("pattern", "[^\\p{L}\\p{N}]");
if (sPattern == null) {
throw new IllegalArgumentException("pattern is missing for [" + name + "] tokenizer of type 'pattern'");
}
this.pattern = Regex.compile(sPattern, settings.get("flags"));
this.group = settings.getAsInt("group", -1);
}
@Override
public Tokenizer create() {
return new PatternTokenizer(pattern, group);
}
}
My TokenFilterfactory file is currently empty.
You can simply use the whitespace tokenizer in your custom analyzer definition, below is the example of custom_analyzer
which uses it.
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": { --> name of custom analyzer
"type": "custom",
"tokenizer": "whitespace", --> note this
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_custom_analyzer" --> note this
}
}
}
}