I have an ElasticSearch search engine and I'm adding synonyms support to it. Everything goes well for unigram synonyms but it's all messed up when starting to deal with multi-words synonyms.
For example, I want the following query - "ice cream" to return each document that talks about "ice cream" or "gelato" or "icecream".
My mapping settings are as follows
PUT stam_test_1
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"plural_stemmer": {
"name": "minimal_english",
"type": "stemmer"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"english_graph_synonyms": {
"type": "synonym_graph",
"tokenizer": "standard",
"expand": true,
"synonyms": [
"ice cream, icecream, creamery, gelato",
"dim sum, dim sim, dimsim",
"ube, purple yam",
"sf, san francisco"
]
},
"english_synonyms": {
"type": "synonym",
"expand": true,
"tokenizer": "standard",
"synonyms": [
"burger, hamburger, slider",
"chicken, pollo",
"pork, pig, porc",
"barbeque, bbq, barbecue",
"sauce, dressing"
]
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"plural_stemmer",
"english_stop",
"english_stemmer",
"asciifolding",
"english_synonyms"
]
},
"english_search": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"plural_stemmer",
"english_stop",
"english_stemmer",
"asciifolding",
"english_graph_synonyms"
]
}
}
}
},
"mappings": {
"properties": {
"text_field": {
"type": "text",
"fields": {
"post_text": {
"type": "text",
"analyzer": "english",
"search_analyzer": "english_search"
}
}
}
}
}
}
I'm adding a few documents
POST _bulk
{ "index" : { "_index" : "stam_test_1", "_id" : "1" } }
{ "post_text" : "Love this ice cream so much!!!"}
{ "index" : { "_index" : "stam_test_1", "_id" : "2" } }
{ "post_text" : "Great gelato and a tasty burger"}
{ "index" : { "_index" : "stam_test_1", "_id" : "3" } }
{ "post_text" : "I bought coke but did not get any ice with it" }
{ "index" : { "_index" : "stam_test_1", "_id" : "4" } }
{ "post_text" : "ic cream" }
When I'm querying for "ice cream" GET /stam_test_1/_search
{
"query": {
"match": {
"post_text": {
"query": "ice cream",
"analyzer": "english_search"}
}
}
}
I get the following results
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 2.6678555,
"hits" : [
{
"_index" : "stam_test_1",
"_type" : "_doc",
"_id" : "10",
"_score" : 2.6678555,
"_source" : {
"post_text" : "ic cream"
}
},
{
"_index" : "stam_test_1",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.6931472,
"_source" : {
"post_text" : "Great gelato and a tasty burger"
}
}
]
}
}
You can see that intentionally I've added an already stemmed document - "ic cream" that returned as I suspected while I didn't get the first document "Love this ice cream so much!!!".
When I'm directly testing the analyzer on "ice cream"
GET stam_test_1/_analyze?
{
"analyzer": "english_search",
"text" : "ice cream"
}
It returns
{
"tokens" : [
{
"token" : "icecream",
"start_offset" : 0,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 0,
"positionLength" : 2
},
{
"token" : "softserv",
"start_offset" : 0,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 0,
"positionLength" : 2
},
{
"token" : "icream",
"start_offset" : 0,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 0,
"positionLength" : 2
},
{
"token" : "creameri",
"start_offset" : 0,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 0,
"positionLength" : 2
},
{
"token" : "gelato",
"start_offset" : 0,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 0,
"positionLength" : 2
},
{
"token" : "ic",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "cream",
"start_offset" : 4,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
The uni-word synonyms are returning properly but the multi-words are stemmed (each token, separately) and it seems that the actual document is not stemmed (this is why I got the 'ic cream' document).
I'm sure this is simply the definition of a setting that went wrong. I tried to replace the english_search analyzer's tokenizer with "keyword" instead of "standard" but no luck with that as well.
Any suggestions on how to deal with this problem? The synonyms_graph feature has a really minor amount of documentation and Google results.
So my mistake was the mappings definition. I shouldn't define fields, all I had to do is using the following mappings and it's all working properly this way
"mappings": {
"properties": {
"post_text": {
"type": "text",
"analyzer": "english",
"search_analyzer": "english_search"
}
}
}