I am new to Elasticsearch. I tried to get result from ES using CData Elasticsearch ODBC driver. Is it possible to get sum of score field?
My code:
OdbcConnection connection = new OdbcConnection("Driver={CData ODBC Driver for Elasticsearch};server=localhost");
connection.Open();
string query = "select sum(_score) from ordersdetails";
OdbcCommand odbcCommand = new OdbcCommand(query, connection);
OdbcDataReader dataReader = odbcCommand.ExecuteReader();
DataTable dataTable = new DataTable();
dataTable.Load(dataReader);
connection.Close();
I have faced the below exception
System.Data.Odbc.OdbcException: 'ERROR [HY000] The '_score' column is not applicable to the sum function.'
But the below query returns result:
"select _id, sum(_score) from ordersdetails group by _id"
Anybody know, Why I got exception when tried to get a result for a single column?
If you know the solution, please share with me.
After doing a couple of experiments with pyodbc
and ElasticSearch I come to these conclusions:
_score
and does not let the user to do so_score
is most likely a bug and is performed not by ElasticSearch but by the driver.In short, don't use _score
for any GROUP BY
, it's a special feature of ElasticSearch dedicated to relevance sorting.
As I already mentioned in the comments to the question, _score
in ElasticSearch is a measure of how document is relevant to a given query (see docs):
The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.
This field is not a part of the document and is computed for every query and every document. In ElasticSearch _score
is used for sorting. However, _score
is not always computed, for instance when sorting on an existing field is required:
The _score is not calculated, because it is not being used for sorting.
Since this field is computed on-the-fly, it is not possible to create efficient aggregation, hence ElasticSearch does not allow this directly. However, this still can be achieved by using scripts in the aggregations.
CData ODBC Driver is aware of _score
field:
When the _score column is selected, scoring will be requested by issuing a query context request, which scores the quality of the search results. By default, results are returned in descending order based on the calculated _score. An ORDER BY clause can be specified to change the order of the returned results.
When the _score column is not selected, a filter context will be sent, in which case Elasticsearch will not compute scores. The results for these queries will be returned in arbitrary order unless an ORDER BY clause is explicitly specified.
Basically, this means that by explicitly mentioning _score
in your query will make ODBC return such field (which might to be there by default).
I installed pyodbc and set up ElasticSearch 5.4 at my localhost. I tuned ES to log all the queries it receives.
At first I reproduced the first case:
cursor.execute("SELECT sum(_score) FROM my_index.my_type")
And received this exception:
[HY000] The '_score' column is not applicable to the sum function.
In the log of ES I catched this query:
{"from":0,"size":100}
Next I took the second query:
cursor.execute("SELECT _id, sum(_score) FROM my_index.my_type GROUP BY _id")
Which executed without exceptions, but resulted in this ES query:
{"from":0,"size":10000,"_source":{"includes":["_id","_score"],"excludes":[]}}
Then I tried to mock the library with non existing fields:
cursor.execute("SELECT sum(score42) FROM simple_index.simple_type")
In this case exception was different:
[HY000] 'score42' is not a valid column.
Although the query sent to ES was the same as in the first case.
Then I tried to find out how does the library send aggregate requests:
cursor.execute("SELECT sum(likes) FROM simple_index.simple_type GROUP BY likes")
In fact, it did use ES aggregations:
{
"from": 0,
"size": 0,
"aggregations": {
"likes": {
"terms": {
"field": "likes",
"size": 2147483647,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": [
{
"_count": "desc"
},
{
"_term": "asc"
}
]
},
"aggregations": {
"sum_likes": {
"sum": {
"field": "likes"
}
}
}
}
}
}
The fact that the library is able to recognize _score
as a special keyword, and also because it did not attempt to generate ES aggregations when asked for sum(_score)
, I assume that it does not allow in general to do aggregations on _score
and the "working" case here is likely a bug.