Search code examples
elasticsearchtokenize

Path Hierarchy Tokenizer in ElasticSearch not working properly


For my project of analyzing access logs I need to make the Path Hierarchy Tokenizer work. The thing is that the analyzer itself seems to be working fine, just not with my indexed data. I have a feeling that something with the mapping might be wrong.

Note: The Elasticsearch version I am working with is 5.6. Upgrading is not an option. I have made the mistake of using some syntax that was not yet available in v.5.6 so I there is a possibility that there is something wrong with the syntax. I have not been able to spot my mistake, though.

This is part of my custom template:

{
"template": "beam-*"
"order" : 20,
"settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "custom_path_tree": {
          "tokenizer": "custom_hierarchy"
        },
        "custom_path_tree_reversed": {
          "tokenizer": "custom_hierarchy_reversed"
        }
      },
      "tokenizer": {
        "custom_hierarchy": {
          "type": "path_hierarchy",
          "delimiter": "/"
        },
        "custom_hierarchy_reversed": {
          "type": "path_hierarchy",
          "delimiter": "/",
          "reverse": "true"
        }
      }
    }
  },

And this is the mapping. The object field contains paths. I want to be able to search object.tree and object.tree_reversed to identify the most visited categories in an online shop.

 "mappings": {
    "logs": {
    "properties": {
      "object": {
        "type": "text",
        "fields": {
          "tree": {
            "type": "text",
            "analyzer": "custom_path_tree"
          },
          "tree_reversed": {
            "type": "text",
            "analyzer": "custom_path_tree_reversed"
          }
        }
      },

When I try this

POST beam-2019.07.02/_analyze
{
  "analyzer": "custom_path_tree",
  "text": "/belletristik/science-fiction/postapokalypse"
}

I get this

{
  "tokens": [
    {
      "token": "/belletristik",
      "start_offset": 0,
      "end_offset": 13,
      "type": "word",
      "position": 0
    },
    {
      "token": "/belletristik/science-fiction",
      "start_offset": 0,
      "end_offset": 29,
      "type": "word",
      "position": 0
    },
    {
      "token": "/belletristik/science-fiction/postapokalypse",
      "start_offset": 0,
      "end_offset": 44,
      "type": "word",
      "position": 0
    }
  ]
}

The analyzer itself seems to be working perfectly fine and is doing what it is supposed to do.

Yet when I try to build a query

GET beam-2019.07.03/_search
{
  "query": {
    "term": {
      "object.tree": "/belletristik/"
    }
  }
}

I get no results, although there should be a few hundred.

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

Maybe my query is wrong. Or something with the mapping doesn't add up?


Solution

  • The term query will not apply analyzer at query time on input string and hence it tries to match /belletristik/. If you notice the output of analyser the token generated by it is /belletristik. There is not slash / at the end of the generated token. So the input term doesn't match any of the document.

    Modify the query as below:

    GET beam-2019.07.03/_search
    {
      "query": {
        "term": {
          "object.tree": "/belletristik"
        }
      }
    }
    

    You can also use match query instead if you don't want to change the input term for the query. Since match will apply analyzer on /belletristik/ as well. This will hence try to match /belletristik (token generated by analyser when applied by match query on /belletristik/) and hence will match the documents.

    GET beam-2019.07.03/_search
    {
      "query": {
        "match": {
          "object.tree": "/belletristik/"
        }
      }
    }