Search code examples
elasticsearch

how to fix elastic fuzzy query to search by phrase without space


I can find the target result when I have a space between words

City Lab

in query:

curl -XGET http://localhost:9200/companies_company_data3/_search -H 'Content-Type: application/json' -d '{
                    "query": {
                      "bool": {
                        "must": {
                          "match": {
                            "name": {
                              "fuzziness": "AUTO",
                              "query": "City Lab"
                            }
                          }
                        }
                      }
                    },
                    "size": 5
                  }'

it gives expected result:

{                                                                                                                                                                                                                         [35/1869]
  "took": 189,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 7463,
    "max_score": 16.600964,
    "hits": [
      {
        "_index": "companies_company_data3",
        "_type": "_doc",
        "_id": "3232333",
        "_score": 16.600964,
        "_source": {
          "sourceId": "22",
          "regionName": "US",
          "name": "City Lab",
          "id": "3232333"
        }
      },

but when I remove the space between these two words: CityLab it can't find it. Full query:

curl -XGET http://localhost:9200/companies_company_data3/_search -H 'Content-Type: application/json' -d '{
                    "query": {
                      "bool": {
                        "must": {
                          "match": {
                            "name": {
                              "fuzziness": "AUTO",
                              "query": "CityLab"
                            }
                          }
                        }
                      }
                    },
                    "size": 5
                  }'

How can I modify the fuzzy query to allow find company name "City Lab" by user's "CityLab" input ?

My index mapping:

curl -XGET http://localhost:9200/companies_company_data3/_mapping -H 'Content-Type: application/json'

returns

{
  "companies_company_data3": {
    "mappings": {
      "_doc": {
        "properties": {
          "id": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "regionName": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "sourceId": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

Solution

  • The short answer is: search for the keyword not the text

    "match": {
      "name.keyword": {
        "fuzziness": "AUTO",
        "query": "CityLab"
      }
    }
    

    More details:

    1) Saving the document

    first, you should know how the query works, when you insert "City Lab" as a keyword, so Elasticsearch will save it as it, as one term, in other words when you search for the term "City Lab" you will get it.

    and when you save it as a text, Elasticsearch will save it like this "city" and "lab".

    2) Searching for the document

    when you use a match, what happens is that the standard text analyzer splits the "City Lab" to "city" and "lab", then searches for the two new terms, and when you apply fuzziness it will be applied for each term separately.

    and when you search for "CityLab", the text analyzer changes it to "citylab" and then it searches for it as one term.

    3) How the query works

    so when you write:

    "match": {
      "name": {
        "fuzziness": "AUTO",
        "query": "CityLab"
      }
    }
    

    knowing that the mapping is:

    "name": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
    }
    

    so what we have in the mapping is a name with type text, and a subfield called keyword with type keyword.


    First search case: in your query when you search for "City Lab" you will use the text analyzer to search for "city" and "lab" in the filed name, which is text type. and you will get results of course because the document has "city" and "lab" inside the name field.


    Second search case: in your query when you search for "CityLab" you will use the text analyzer to search for "citylab" in the filed name, which is text type. and you will not get results of course because the document has "city" and "lab" inside the name field.

    "citylab" -> "city" -> 3 changes "citylab" -> "lab" -> 4 changes


    4) Solution

    search for the keyword not for the text. as the keyword field contains "City Lab" as one term.

    "match": {
      "name.keyword": {
        "fuzziness": "AUTO",
        "query": "CityLab"
      }
    }
    

    when you search here for "CityLab" it will be like this:

    "CityLab" -> "City Lab" -> 1 change


    another solution is to change the text analyzer but I guess this is not what you are looking for, but in general, changing the analyzer to a custom one instead of the standards so that you can save the text including the spaces.


    another solution is using wildcard where you can search for "City*Lab" but I also don't think that you are looking for this one


    Note that

    Fuzziness calculates changes as an edit distance which is the number of one-character changes needed to turn one term into another. These changes can include:

    1. Changing a character (box → fox)
    2. Removing a character (black → lack)
    3. Inserting a character (sic → sick)
    4. Transposing two adjacent characters (act → cat)