ElasticSearch query with filters and occurrence number

I have an ES instance that I push logs into. Then ES is used to search those logs. This is not ideal, there are plans to change it, but it is what it is. I'm sorry for a long description, but bear with me, the question is simple.

For now the search goes like this:

I have an index with N log lines
user enters a phrase to search for
I construct the ES query with:
- this phrase in query
- size=1 (so I only find one line)
- track_total_hits=true
- from=0
- sort=<something>

So, this gives me a first occurrence of a line with a particular query (because they are sorted, i.e. by timestamp). I also get the total hits, so I can present the user with:

the found line
the occurrence number (with initial search it's always 1)
the total hits

So the user knows this is a 1/300 occurence and can prompt the UI to find the next one. The search is the same, but if user wants to search the next occurrence, I just pass from=1, from=2 etc. And the performance of this is pretty okay, since I only have to download one line from ES.

That's great. However, this is all on a website that shows user the logs. What I want to do is when the user does the inital search (before going next/previous occurrence), I want to show them the first line "after their cursor position"

For example, the user sees:

58 foo
59 bar
60 baz
[...]

so I want to scroll him down to a first matching line after line 58, not before.

The problem is, I still want to display the 1/<something> occurrences found. In this case it could be that the initial search would return for example a fifth occurrence, i.e. 5/300. And the user could go to previous/next ones.

So, the solution is to download all the matching lines (without from= and size= in query). And then just do a for loop on them, find the line that has a line number higher than the one the user sees (i.e. 58), return it. And by doing that, I can also count "which occurrence" is that, so I'll know to display for example 5/300 on UI.

The problem with that is: I have to download all the lines from ES to do that. In case of indexes that have millions and millions of lines, that could be a huge performance hit. So what I want to know is: is there a way to tell Elastic to:

get all the matching lines (matching phrase)
apply another filter here (line number > something)
get this line, but also return the information on "which occurrence of a matching line is that" (in all the matching lines, without the "line number" filter)

so for lines like:

54 content
55 content
56 content
57 content
58 foo
59 bar
60 baz
61 content
[...]

phrase: content, seaching "from line 58", I'd have a response like:

{
  "line": {"line_number": 61, "content": "content"},
  "total_hits": 300,
  "occurrence": 5
}

Solution

There are several different methods of achieving this all based on the same principle. You need to perform three searches:

one without line filter to figure out the total number of occurrences
one with filter before your current line to get the count of records that precede the current occurrence
one with filtering by range after your current line to find the current occurrence

This can be done with multi-search, filter + top_hit aggregation, and with filter + global aggregation. Here is an example of how to achieve that using filter + global aggregation:

DELETE test
PUT test
{
  "mappings": {
    "properties": {
      "line_no": {
        "type": "integer"
      },
      "line": {
        "type": "text"
      }
    }
  }
}

POST test/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "line_no": 54, "line": "content"}
{ "index": { "_id": "2" } }
{ "line_no": 55, "line": "content"}
{ "index": { "_id": "3" } }
{ "line_no": 56, "line": "content"}
{ "index": { "_id": "4" } }
{ "line_no": 57, "line": "content"}
{ "index": { "_id": "5" } }
{ "line_no": 58, "line": "foo"}
{ "index": { "_id": "6" } }
{ "line_no": 59, "line": "bar"}
{ "index": { "_id": "7" } }
{ "line_no": 60, "line": "baz"}
{ "index": { "_id": "8" } }
{ "line_no": 61, "line": "content"}
{ "index": { "_id": "9" } }
{ "line_no": 62, "line": "content"}
{ "index": { "_id": "10" } }
{ "line_no": 63, "line": "content"}



POST test/_search?filter_path=hits.hits,aggregations.all.all_occurrencess.doc_count,aggregations.all.all_occurrences.previous_occurrences.doc_count
{
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "line_no": {
              "gt": 59
            }
          }
        },
        {
          "match": {
            "line": "content"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "line_no": {
        "order": "asc"
      }
    }
  ],
  "aggs": {
    "all": {
      "global": {},
      "aggs": {
        "all_occurrences": {
          "filter": {
            "match": {
              "line": "content"
            }
          },
          "aggs": {
            "previous_occurrences": {
              "filter": {
                "range": {
                  "line_no": {
                    "lte": 59
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

The result of this query will be :

{
  "hits": {
    "hits": [
      {
        "_index": "test",
        "_id": "8",
        "_score": 1.3829923,
        "_source": {
          "line_no": 61,
          "line": "content"
        },
        "sort": [
          61
        ]
      }
    ]
  },
  "aggregations": {
    "all": {
      "all_occurrences": {
        "previous_occurrences": {
          "doc_count": 4
        }
      }
    }
  }
}

In the result above hits.hits[0] will represent the next line matching your query after line 59. The aggregations.all.all_occurrences.doc_count will represent the number of line that contain "content" (it was 300 in your theoretical example, but I reduced it to 7 because for the example to be concise). And finally aggregations.all.all_occurrences.previous_occurrences.doc_count represents that number of occurrences that happened before your current line. To get the current occurrence number you will need to add 1 to it.