Search code examples
elasticsearch

Elastic Search - Search using multiple indexes with documents that have different data structures


I'm working with two different indexes, and different data structures in each index. Both indexes have a matching property on the documents, this property is a unique identifier of these documents, so bellow you can see the structures of the documents and how a document should look like in case it is complete.

I need to create filters and metrics (metrics can be done in the code later on if necessary), that removes from query, the documents that don't fully match the conditions, but the properties are spread within this two indexes, so when I filter by status code 2, the pair document in the other index should not be included in the search.

https://elasticsearch.*.com.br/index_a,index_b/_search

index_a

{"matching_pair_key": "1", "status_code": 2},
{"matching_pair_key": "2", "status_code": 2},
{"matching_pair_key": "2", "status_code": 1},

index_b

{"matching_pair_key": "1", "age": "31"},
{"matching_pair_key": "2", "age": "33"},
{"matching_pair_key": "3", "age": "18"},
{"matching_pair_key": "4", "age": "52"},

Complete document

{"matching_pair_key": "1", "age": "31", "status_code": 2}

I need to create a query that "join" and retrieves these documents, as well as filtering them, such as:

{
    "must": [
        "match": {
            "status_code": 2 
        }
    ]
}

Should respond:

  • This whould be the ideal thing for me, but I understand that is no easy, or maybe even impossible.
{"matching_pair_key": "1", "status_code": 2, "age": "31"},
{"matching_pair_key": "2", "status_code": 2, "age": "33"},
{"matching_pair_key": "2", "status_code": 2, "age": "33"}

I tried doing an aggregation that join these documents in buckets:

"aggs": {
    "joined_docs": {
        "terms": {
            "field": "matching_pair_key.keyword"
        },
        "aggs": {
            "_top_hits_agg": {
                "top_hits": {
                    "_source": {
                        "includes": [
                            "matching_pair_key",
                            "status_code",
                            "age"
                        ]
                    },
                    "size": 10
                }
            }
        }
    }
}

But when I do that, I can't use filters on my must statement, because if I filter for a property that don't exists in the matching document, that document wont be included in the result.

I could fix this if I knew a way of including the doc on the search if it doesn't have that property that i'm trying to filter, but I don't know how. And I want to avoid using bucket aggregation if I can, because the database is quite large.

I'm open for suggestion on how to approach this issue, in my experience I only thought about using the aggregations to work with this data, but I'm afraid of how much this will cost in terms of processing and how long it takes.

Key points:

  • It is assured to be a matching property for every doc. (matching_pair_key)
  • Very large database
  • I can do using buckets, but the filters has been constraining so far (and avoiding bucket will make it easier on the code later on).
  • I'm open for any sort of suggestion

Solution

  • you're looking for a join, which Elasticsearch can't do

    your best option would be to merge these two indices together. this could be achieved using reindex via an ingest pipeline, that does an enrich - https://www.elastic.co/guide/en/elasticsearch/reference/current/enrich-processor.html