I am trying to make an Elasticsearch filter, analyzer and tokenizer to be able to normalize searches like:
"henry&william book"
-> "henrywilliam book"
"henry & william book"
-> "henrywilliam book"
"henry and william book"
-> "henrywilliam book"
"henry william book"
-> "henry william book"
In other words, I would like to normalize my "and" and "&" queries, but also concatenate the words between them.
I'm thinking of making a tokenizer that breaks "henry & william book"
into tokens ["henry & william", "book"]
, and then make a character filter that makes the following replacements:
" & "
-> ""
" and "
-> ""
"&"
-> ""
However, this feels a bit hackish. Is there a better way to do it?
The reason I can't just do this entirely in the analyzer/filter phase, is that it runs too late. In my attempts, Elasticsearch has already broken "henry & william"
into just ["henry", "william"]
before my analyzer/filter runs.
You can use a clever mix of two character filters that kick in before the tokenizer. The first character filter would map and
onto &
and the second character filter would get rid of the &
and glue the two neighboring tokens together. This mix would also allow you to introduce other replacements, such as |
and or
for instance.
PUT test
{
"settings": {
"analysis": {
"char_filter": {
"and": {
"type": "mapping",
"mappings": [
"and => &"
]
},
"&": {
"type": "pattern_replace",
"pattern": """(\w+)(\s*&\s*)(\w+)""",
"replacement": "$1$3"
}
},
"analyzer": {
"my-analyzer": {
"type": "custom",
"char_filter": [
"and", "&"
],
"tokenizer": "keyword"
}
}
}
}
}
This would yields the following results:
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry&william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henrywilliam book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry & william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henrywilliam book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry and william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henrywilliam book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henry william book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}