Search code examples
elasticsearchelasticsearch-dsl

Elasticsearch DSL query - Get all matching results


I am trying to search an index using DSL query. I have many documents which matches the criteria of log and the range of timestamp.
I am passing dates and converting it to epoch milli seconds.
But I am specifying size parameter in DSL query.
What I see is that if I specify 5000, it extracts 5000 records in the time range. But there are more number of records in the specified time range.
How to retrieve all data matching the range of time so that I dont need to specify the size?

My DSL query is as below.

GET localhost:9200/_search    
{
    "query": {
      "bool": {
        "must": [
          {"match_phrase": {
              "log":  "SOME_VALUE"
              }
            },
             {"range": {
                "@timestamp": {
                  "gte": "'"${fromDate}"'", 
                  "lte": "'"${toDate}"'", 
                  "format": "epoch_millis"
                }
              }
            }
                ]
              }
            },    
        "size":5000
}

fromDate = 1519842600000
toDate = 1520533800000


Solution

  • I couldn't get the scan API or scroll pattern working as it was also not showing expected result.

    I finally figured out a way to capture the number of hits and then pass that as parameter to extract the data.

    GET localhost:9200/_count    
    {
    "query": {
      "bool": {
        "must": [
          {"match_phrase": {
              "log":  "SOME_VALUE"
              }
            },
             {"range": {
                "@timestamp": {
                  "gte": "'"${fromDate}"'", 
                  "lte": "'"${toDate}"'", 
                  "format": "epoch_millis"
                }
              }
            }
                ]
              }
            }
    }' > count_size.txt
    size_count=`cat count_size.txt  | cut -d "," -f1 | cut -d ":" -f2`
    echo "Total hits matching this criteria is ${size_count}"
    

    From this I get the size_count value. If this value is less than 10000, extract the value, else reduce the time range for extraction.

    GET localhost:9200/_search    
    {
    "query": {
      "bool": {
        "must": [
          {"match_phrase": {
              "log":  "SOME_VALUE"
              }
            },
             {"range": {
                "@timestamp": {
                  "gte": "'"${fromDate}"'", 
                  "lte": "'"${toDate}"'", 
                  "format": "epoch_millis"
                }
              }
            }
                ]
              }
            },    
        "size":'"${size_count}"'
    }
    

    If large set of data is required for an extensive period, I need to run this with a different set of dates and combine them together to get an overall required reports.

    This complete piece of code is written is shell script so I am able to use it much simpler.