elasticsearch lucene elasticsearch-phonetic

Why is a phonetic search so much slower than a normal match query

Summary: I'm trying to understand why two queries that seem very similar in complexity are vastly different in execution speed.

I'm using Elastic Search 6.4 and i'm having a name field that i would like to use phonetic queries on.

As an example I profiled a phonetic query for the search term "Mario" and found out that Lucene in the backgroud is executing this as a SynonymQuery:

        "type": "SynonymQuery",
        "description": "Synonym(person.firstName.phonetic:mYrio person.firstName.phonetic:mari person.firstName.phonetic:mario person.firstName.phonetic:mori person.firstName.phonetic:morio)",

and it takes around 200ms to do so on an index with ~15 million records.

Since it seemed to convert my single search term into 5 synonyms, i thought "well, what if i search for the same 5 terms without phonetic? Will it be similarily slow?" or in other words "is it not the phonetic part that makes it slow, but the fact that it has to search for several synonyms?"

But it turns out if i query the field without phonetic for "mario mYrio mari mori morio" it will result in a BooleanQuery (with one term query per synonym as children):

        "type": "BooleanQuery",
        "description": "person.firstName:mario person.firstName:mYrio person.firstName:mari person.firstName:mori person.firstName:morio",

that takes only 1/10th of the time. Please note: I know and understand that those two queries give different results. I'm not trying to simulate a phonetic search with the second query. i just wanted to see if it would be slow as well, because it seemed to be a query om similar complexity.

for someone like me, who only recently started using Elastic Search, those two queries look very similar in complexity (search for 5 terms with an OR operator) and i can't understand why one is so much slower than the other.

Any insight would be much appreciated!

Thanks in advance!

regards Mario

P.S.: i realised it will probably help if i include the two queries i used in this example:

first query (phonetic):

{
  "profile": true,
  "size": 1,
  "timeout": "10s",
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "person.firstName.phonetic": {
              "query": "mario",
              "operator": "OR",
              "prefix_length": 0,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": true,
              "boost": 1
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  }
}

second query (non-phonetic):

{
  "profile": true,
  "size": 1,
  "timeout": "10s",
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "person.firstName": {
              "query": "mario myrio mari mori morio",
              "operator": "OR",
              "fuzziness": "0",
              "prefix_length": 3,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": true,
              "boost": 1
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  }
}

Solution

I would say it’s pretty clear what’s the difference between those two - rewrite process aka expanding term mario to a synonyms that exists. This process basically requires you to process trough SynonymGraphFilter, which I believe read data about synonyms from disk, which makes things slower.

In case of the boolean query the match is going through different analyzer chain (which I believe is just the same a phonetic, but without synonyms)