I've recently decided to make an attempt to reindex an existing denormalized index to a new index with parent-chid relation. I've around 14M parent docs, each parent has up to 400 children.(total of around 270M docs)
This is a simplified version of my mapping ->
{
"mappings": {
"_doc": {
"properties": {
"product_type": {
"type": "keyword"
},
"relation_type": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"product_data": [
"kpi",
"customer"
]
}
},
"rootdomain": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"rootdomain_sku": {
"type": "keyword",
"eager_global_ordinals": true
},
"sales_1d": {
"type": "float"
},
"sku": {
"type": "keyword",
"eager_global_ordinals": true
},
"timestamp": {
"type": "date",
"format": "strict_date_optional_time_nanos"
}
}
}
}
}
As you can see I've used eager_global_ordinals for the join relation to speed up search performance
(per my understanding this causes some of the join relation computation in global ordinals to be done in indexing time and not while querying).
This migration process helped me reduce my index size from around 500GB to just 40GB. It has a huge benefit for my use case since I update a lot of data daily.
My current testing environment is using a single node, and the index has only 1 primary shard. Trying to run the following aggregation, seems like it runs forever -
{
"aggs": {
"skus_sales": {
"aggs": {
"sales1": {
"children": {
"type": "kpi"
},
"aggs": {
"sales2": {
"filter": {
"range": {
"timestamp": {
"format": "basic_date_time_no_millis",
"gte": "20220601T000000Z",
"lte": "20220605T235959Z"
}
}
},
"aggs": {
"sales3": {
"sum": {
"field": "sales_1d"
}
}
}
}
}
}
},
"terms": {
"field": "rootdomain_sku",
"size": 10
}
}
},
"query": {
"bool": {
"filter": [
{
"term": {
"rootdomain.keyword": "some_domain"
}
},
{
"term": {
"product_type": "Rugs"
}
}
]
}
}
}
I understand the cons of parent-child relations, but it seems like I'm doing something wrong. I would expect to get some result, even after 15 minutes, but it seems to run forever.
I would love to get some help here, Thanks.
Seems like the issue is using a single shard, by increasing the # of primary shards (1->4) i've managed to gain some performance boost, but it still runs for a very(!) long time.
Seems like parent-child relation query performance does not meet my requirements so I'm trying to use nested objects instead - by doing so updating/indexing time will increase but I'll gain search/aggregation performance boost.