Search code examples
elasticsearchhierarchical-datadenormalization

How to denormalize hierarchy in ElasticSearch?


I am new to ElasticSearch and I have a tree, which describes a path to a certain document (not real filesystem paths, just simple text fields categorizing articles, images, documents as one). Each path entry has a type, like.: Group Name, Assembly name or even Unknown. The types could be used in queries to skip certain entries in the path for example.

My source data is stored in SQL Server, the schema looks something like this: enter image description here

Tree builds up by connecting the Tree.Id to Tree.ParentId, but each node must have a type. The Documents are connected to a leaf in the Tree.

I am not worried about querying the structure in SQL Server, however I should find an optimal approach to denormalize and search them in Elastic. If I flatten the paths and make a list of "descriptors" for a document, I can store each of the Document entries as an Elastic Document.:

{
  "path": "NodeNameRoot/NodeNameLevel_1/NodeNameLevel_2/NodeNameLevel_3/NodeNameLevel_4",
  "descriptors": [
    {
      "name": "NodeNameRoot",
      "type": "type1"
    },
    {
      "name": "NodeNameLevel_1",
      "type": "type1"
    },
    {
      "name": "NodeNameLevel_2",
      "type": "type2"
    },
    {
      "name": "NodeNameLevel_3",
      "type": "type2"
    },
    {
      "name": "NodeNameLevel_4",
      "type": "type3"
    }
  ],
  "document": {
    ...
  }
}

Can I query such a structure in ElasticSearch? Or Should I denormalize the paths in a different way?

My main questions:

Can query them based on type or text value (regex matching for example). For example: Give me all the type2->type3 paths (practically leave the type1 out), where the path contains X?

Is it possible to query based on levels? Like I would like the paths where there are 4 descriptors.

Can I do the searching with the built-in functionality or do I need to write an extension?

Edit Based on G Quintana 's anwser, I made an index like this.:

curl -X PUT \
  http://localhost:9200/test \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' \
  -d '{
  "mappings": {
    "path": {
      "properties": {
        "names": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword"
            },
            "tokens": {
              "type": "text",
              "analyzer": "pathname_analyzer"
            },
            "depth": {
              "type": "token_count",
              "analyzer": "pathname_analyzer"
            }
          }
        },
        "types": {
          "type": "text",
          "fields": {
            "raw": {
                "type": "keyword"
            },
            "tokens": {
                "type": "text",
                "analyzer": "pathname_analyzer"
            }
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "pathname_analyzer": {
          "type": "pattern",
          "pattern": "#->>",
          "lowercase": true
        }
      }
    }
  }
}'

And could query the depth like this.:

curl -X POST \
  http://localhost:9200/test/path/_search \
  -H 'content-type: application/json' \
  -d '{
    "query": {
        "bool": {
            "should": [
                {"match": { "names.depth": 5 }}
            ]
        }
    }
}'

Which return correct results. I will test it a little more.


Solution

  • First of all you should identify all your query patterns to design how you will index your data.

    From the example you gave, I would index documents of the form:

    {
      "path": "NodeNameRoot/NodeNameLevel_1/NodeNameLevel_2/NodeNameLevel_3/NodeNameLevel_4",
      "types: "type1/type1/type2/type2/type3",
      "document": {
        ...
      }
    }
    

    Before indexing, you must configure mapping and analysis:

    • Field path:
    • Field types

    Configure index mappings and analysis to split the path and types fields and the , use a or a

    1. Give me all the type2->type3 paths use a match_phrase query on the types field
    2. where the path contains X use match query on the path field
    3. where there are 4 descriptors use term query on path.depth sub field

    Your descriptors field is not interesting. The Path tokenizer might be interesting for some usecases. You can apply multiple analyzer on the same field using multi-fields and then query if sub fields.