Search code examples
elasticsearchodbcnestcdatacdata-drivers

How to get sum of score column from ElasticSearch


I am new to Elasticsearch. I tried to get result from ES using CData Elasticsearch ODBC driver. Is it possible to get sum of score field?

My code:

OdbcConnection connection = new OdbcConnection("Driver={CData ODBC Driver for Elasticsearch};server=localhost");
        connection.Open();
        string query = "select sum(_score) from ordersdetails";
        OdbcCommand odbcCommand = new OdbcCommand(query, connection);
        OdbcDataReader dataReader = odbcCommand.ExecuteReader();
        DataTable dataTable = new DataTable();
        dataTable.Load(dataReader);
        connection.Close();

I have faced the below exception

System.Data.Odbc.OdbcException: 'ERROR [HY000] The '_score' column is not applicable to the sum function.'

But the below query returns result:

"select _id, sum(_score) from ordersdetails group by _id"

Anybody know, Why I got exception when tried to get a result for a single column?

If you know the solution, please share with me.


Solution

  • After doing a couple of experiments with pyodbc and ElasticSearch I come to these conclusions:

    1. CData ODBC driver knows that it is not possible to do aggregation on _score and does not let the user to do so
    2. Behavior where it actually computes aggregation by _score is most likely a bug and is performed not by ElasticSearch but by the driver.

    In short, don't use _score for any GROUP BY, it's a special feature of ElasticSearch dedicated to relevance sorting.

    A bit of introduction

    As I already mentioned in the comments to the question, _score in ElasticSearch is a measure of how document is relevant to a given query (see docs):

    The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.

    This field is not a part of the document and is computed for every query and every document. In ElasticSearch _score is used for sorting. However, _score is not always computed, for instance when sorting on an existing field is required:

    The _score is not calculated, because it is not being used for sorting.

    Since this field is computed on-the-fly, it is not possible to create efficient aggregation, hence ElasticSearch does not allow this directly. However, this still can be achieved by using scripts in the aggregations.

    CData ODBC Driver is aware of _score field

    CData ODBC Driver is aware of _score field:

    When the _score column is selected, scoring will be requested by issuing a query context request, which scores the quality of the search results. By default, results are returned in descending order based on the calculated _score. An ORDER BY clause can be specified to change the order of the returned results.

    When the _score column is not selected, a filter context will be sent, in which case Elasticsearch will not compute scores. The results for these queries will be returned in arbitrary order unless an ORDER BY clause is explicitly specified.

    Basically, this means that by explicitly mentioning _score in your query will make ODBC return such field (which might to be there by default).

    The experiments

    I installed pyodbc and set up ElasticSearch 5.4 at my localhost. I tuned ES to log all the queries it receives.

    1.

    At first I reproduced the first case:

    cursor.execute("SELECT sum(_score) FROM my_index.my_type")
    

    And received this exception:

    [HY000] The '_score' column is not applicable to the sum function.
    

    In the log of ES I catched this query:

    {"from":0,"size":100}
    

    2.

    Next I took the second query:

    cursor.execute("SELECT _id, sum(_score) FROM my_index.my_type GROUP BY _id")
    

    Which executed without exceptions, but resulted in this ES query:

    {"from":0,"size":10000,"_source":{"includes":["_id","_score"],"excludes":[]}}
    

    3.

    Then I tried to mock the library with non existing fields:

    cursor.execute("SELECT sum(score42) FROM simple_index.simple_type")
    

    In this case exception was different:

    [HY000] 'score42' is not a valid column.
    

    Although the query sent to ES was the same as in the first case.

    4.

    Then I tried to find out how does the library send aggregate requests:

    cursor.execute("SELECT sum(likes) FROM simple_index.simple_type GROUP BY likes")
    

    In fact, it did use ES aggregations:

    {
      "from": 0,
      "size": 0,
      "aggregations": {
        "likes": {
          "terms": {
            "field": "likes",
            "size": 2147483647,
            "min_doc_count": 1,
            "shard_min_doc_count": 0,
            "show_term_doc_count_error": false,
            "order": [
              {
                "_count": "desc"
              },
              {
                "_term": "asc"
              }
            ]
          },
          "aggregations": {
            "sum_likes": {
              "sum": {
                "field": "likes"
              }
            }
          }
        }
      }
    }
    

    Conclusions

    The fact that the library is able to recognize _score as a special keyword, and also because it did not attempt to generate ES aggregations when asked for sum(_score), I assume that it does not allow in general to do aggregations on _score and the "working" case here is likely a bug.