elasticsearch hierarchical-data denormalization

How to denormalize hierarchy in ElasticSearch?

I am new to ElasticSearch and I have a tree, which describes a path to a certain document (not real filesystem paths, just simple text fields categorizing articles, images, documents as one). Each path entry has a type, like.: Group Name, Assembly name or even Unknown. The types could be used in queries to skip certain entries in the path for example.

My source data is stored in SQL Server, the schema looks something like this:

Tree builds up by connecting the Tree.Id to Tree.ParentId, but each node must have a type. The Documents are connected to a leaf in the Tree.

I am not worried about querying the structure in SQL Server, however I should find an optimal approach to denormalize and search them in Elastic. If I flatten the paths and make a list of "descriptors" for a document, I can store each of the Document entries as an Elastic Document.:

{
  "path": "NodeNameRoot/NodeNameLevel_1/NodeNameLevel_2/NodeNameLevel_3/NodeNameLevel_4",
  "descriptors": [
    {
      "name": "NodeNameRoot",
      "type": "type1"
    },
    {
      "name": "NodeNameLevel_1",
      "type": "type1"
    },
    {
      "name": "NodeNameLevel_2",
      "type": "type2"
    },
    {
      "name": "NodeNameLevel_3",
      "type": "type2"
    },
    {
      "name": "NodeNameLevel_4",
      "type": "type3"
    }
  ],
  "document": {
    ...
  }
}

Can I query such a structure in ElasticSearch? Or Should I denormalize the paths in a different way?

My main questions:

Can query them based on type or text value (regex matching for example). For example: Give me all the type2->type3 paths (practically leave the type1 out), where the path contains X?

Is it possible to query based on levels? Like I would like the paths where there are 4 descriptors.

Can I do the searching with the built-in functionality or do I need to write an extension?

Edit Based on G Quintana 's anwser, I made an index like this.:

curl -X PUT \
  http://localhost:9200/test \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' \
  -d '{
  "mappings": {
    "path": {
      "properties": {
        "names": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword"
            },
            "tokens": {
              "type": "text",
              "analyzer": "pathname_analyzer"
            },
            "depth": {
              "type": "token_count",
              "analyzer": "pathname_analyzer"
            }
          }
        },
        "types": {
          "type": "text",
          "fields": {
            "raw": {
                "type": "keyword"
            },
            "tokens": {
                "type": "text",
                "analyzer": "pathname_analyzer"
            }
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "pathname_analyzer": {
          "type": "pattern",
          "pattern": "#->>",
          "lowercase": true
        }
      }
    }
  }
}'

And could query the depth like this.:

curl -X POST \
  http://localhost:9200/test/path/_search \
  -H 'content-type: application/json' \
  -d '{
    "query": {
        "bool": {
            "should": [
                {"match": { "names.depth": 5 }}
            ]
        }
    }
}'

Which return correct results. I will test it a little more.

Solution

First of all you should identify all your query patterns to design how you will index your data.

From the example you gave, I would index documents of the form:

{
  "path": "NodeNameRoot/NodeNameLevel_1/NodeNameLevel_2/NodeNameLevel_3/NodeNameLevel_4",
  "types: "type1/type1/type2/type2/type3",
  "document": {
    ...
  }
}

Before indexing, you must configure mapping and analysis:

Field path:
- use type text + analyzer based on pattern analyzer to split at / characters
- use type token_count + same analyzer to compute path depth. Create a multi field (path.depth)
Field types
- use type text + analyzer based on pattern analyzer to split at / characters

Configure index mappings and analysis to split the path and types fields and the , use a or a

Give me all the type2->type3 paths use a match_phrase query on the types field
where the path contains X use match query on the path field
where there are 4 descriptors use term query on path.depth sub field

Your descriptors field is not interesting. The Path tokenizer might be interesting for some usecases. You can apply multiple analyzer on the same field using multi-fields and then query if sub fields.