Search code examples
elasticsearchsnowball

ElasticSearch: strange search behaviour when using snowball analyzer


So let's say I have an ElasticSearch index defined like this:

curl -XPUT 'http://localhost:9200/test' -d '{
  "mappings": {
    "example": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "snowball"
        }
      }
    }
  }
}'

curl -XPUT 'http://localhost:9200/test/example/1' -d '{
  "text": "foo bar organization"
}'

When I search for "foo organizations" with snowball analyzer, both keywords match as expected:

curl -XGET http://localhost:9200/test/example/_search -d '{
  "query": {
    "text": {
      "_all": {
        "query": "foo organizations",
        "analyzer": "snowball"
      }
    }
  },
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}'

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.015912745,
    "hits": [
      {
        "_index": "test",
        "_type": "example",
        "_id": "1",
        "_score": 0.015912745,
        "_source": {
          "text": "foo bar organization"
        },
        "highlight": {
          "text": [
            "<em>foo</em> bar <em>organization</em>"
          ]
        }
      }
    ]
  }
}

But when I search for only "organizations" I don't get any result at all which is very weird:

curl -XGET http://localhost:9200/test/example/_search -d '{
  "query": {
    "text": {
      "_all": {
        "query": "organizations",
        "analyzer": "snowball"
      }
    }
  },
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}'

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

However, if I search for "bars" it still hits:

curl -XGET http://localhost:9200/test/example/_search -d '{
  "query": {
    "text": {
      "_all": {
        "query": "bars",
        "analyzer": "snowball"
      }
    }
  },
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}'

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.10848885,
    "hits": [
      {
        "_index": "test",
        "_type": "example",
        "_id": "1",
        "_score": 0.10848885,
        "_source": {
          "text": "foo bar organization"
        },
        "highlight": {
          "text": [
            "foo <em>bar</em> organization"
          ]
        }
      }
    ]
  }
}

I guess the difference between "bar" and "organization" is that "organization" is stemmed to "organ" while "bar" is stemmed to itself. But how do I get the proper behaviour so that 2nd search hits?


Solution

  • Text "foo bar organization" is getting indexed twice - in the field text and in the field _all. The field text is using snowball analyzer, and the field _all is using standard analyzer. Therefore after analysis of the test record the field _all contains tokens: "foo", "bar", and "organization". During search specified snowball analyzer converts "foo" into "foo", "bars" into "bar" and "organization" into "organ". So, words "foo" and "bars" in the query match the test record and the term "organization" doesn't. Highlighting is performed on per field basis independently from searching and that's why word "organization" is getting highlighted in the first result.