I have an ES instance that I push logs into. Then ES is used to search those logs. This is not ideal, there are plans to change it, but it is what it is. I'm sorry for a long description, but bear with me, the question is simple.
For now the search goes like this:
size=1
(so I only find one line)track_total_hits=true
from=0
sort=<something>
So, this gives me a first occurrence of a line with a particular query (because they are sorted, i.e. by timestamp). I also get the total hits, so I can present the user with:
1
)So the user knows this is a 1/300 occurence and can prompt the UI to find the next one. The search is the same, but if user wants to search the next occurrence, I just pass from=1
, from=2
etc. And the performance of this is pretty okay, since I only have to download one line from ES.
That's great. However, this is all on a website that shows user the logs. What I want to do is when the user does the inital search (before going next/previous occurrence), I want to show them the first line "after their cursor position"
For example, the user sees:
58 foo
59 bar
60 baz
[...]
so I want to scroll him down to a first matching line after line 58
, not before.
The problem is, I still want to display the 1/<something>
occurrences found. In this case it could be that the initial search would return for example a fifth occurrence, i.e. 5/300
. And the user could go to previous/next ones.
So, the solution is to download all the matching lines (without from=
and size=
in query). And then just do a for loop on them, find the line that has a line number higher than the one the user sees (i.e. 58
), return it. And by doing that, I can also count "which occurrence" is that, so I'll know to display for example 5/300
on UI.
The problem with that is: I have to download all the lines from ES to do that. In case of indexes that have millions and millions of lines, that could be a huge performance hit. So what I want to know is: is there a way to tell Elastic to:
so for lines like:
54 content
55 content
56 content
57 content
58 foo
59 bar
60 baz
61 content
[...]
phrase: content
, seaching "from line 58", I'd have a response like:
{
"line": {"line_number": 61, "content": "content"},
"total_hits": 300,
"occurrence": 5
}
There are several different methods of achieving this all based on the same principle. You need to perform three searches:
This can be done with multi-search, filter + top_hit aggregation, and with filter + global aggregation. Here is an example of how to achieve that using filter + global aggregation:
DELETE test
PUT test
{
"mappings": {
"properties": {
"line_no": {
"type": "integer"
},
"line": {
"type": "text"
}
}
}
}
POST test/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "line_no": 54, "line": "content"}
{ "index": { "_id": "2" } }
{ "line_no": 55, "line": "content"}
{ "index": { "_id": "3" } }
{ "line_no": 56, "line": "content"}
{ "index": { "_id": "4" } }
{ "line_no": 57, "line": "content"}
{ "index": { "_id": "5" } }
{ "line_no": 58, "line": "foo"}
{ "index": { "_id": "6" } }
{ "line_no": 59, "line": "bar"}
{ "index": { "_id": "7" } }
{ "line_no": 60, "line": "baz"}
{ "index": { "_id": "8" } }
{ "line_no": 61, "line": "content"}
{ "index": { "_id": "9" } }
{ "line_no": 62, "line": "content"}
{ "index": { "_id": "10" } }
{ "line_no": 63, "line": "content"}
POST test/_search?filter_path=hits.hits,aggregations.all.all_occurrencess.doc_count,aggregations.all.all_occurrences.previous_occurrences.doc_count
{
"size": 1,
"query": {
"bool": {
"must": [
{
"range": {
"line_no": {
"gt": 59
}
}
},
{
"match": {
"line": "content"
}
}
]
}
},
"sort": [
{
"line_no": {
"order": "asc"
}
}
],
"aggs": {
"all": {
"global": {},
"aggs": {
"all_occurrences": {
"filter": {
"match": {
"line": "content"
}
},
"aggs": {
"previous_occurrences": {
"filter": {
"range": {
"line_no": {
"lte": 59
}
}
}
}
}
}
}
}
}
}
The result of this query will be :
{
"hits": {
"hits": [
{
"_index": "test",
"_id": "8",
"_score": 1.3829923,
"_source": {
"line_no": 61,
"line": "content"
},
"sort": [
61
]
}
]
},
"aggregations": {
"all": {
"all_occurrences": {
"previous_occurrences": {
"doc_count": 4
}
}
}
}
}
In the result above hits.hits[0]
will represent the next line matching your query after line 59. The aggregations.all.all_occurrences.doc_count
will represent the number of line that contain "content" (it was 300 in your theoretical example, but I reduced it to 7 because for the example to be concise). And finally aggregations.all.all_occurrences.previous_occurrences.doc_count
represents that number of occurrences that happened before your current line. To get the current occurrence number you will need to add 1 to it.