Search code examples
druidpydruid

How to get Quantile/median values in pydruid


My goal is to query the median value of column height in my druid datasource. I was able to use other aggregations like count and count distinct values. Here's my query so far:

group = query.groupby(
    datasource=datasource,
    granularity='all',
    intervals='2020-01-01T00:00:00+00:00/2101-01-01T00:00:00+00:00',
    dimensions=[
        "category_a"
    ],
    filter=(Dimension("country") == country_id),
    aggregations={
        'count': longsum('count'),
        'count_distinct_city': aggregators.thetasketch('city'),
    }
)

There's a class Quantile under postaggregator.py so I tried using this.

class Quantile(Postaggregator):
    def __init__(self, name, probability):
        Postaggregator.__init__(self, None, None, name)
        self.post_aggregator = {
            "type": "quantile",
            "fieldName": name,
            "probability": probability,
        }

Here's my attempt at getting the median:

post_aggregations={
    'median_value': postaggregator.Quantile(
        'height', 50 
     )
}

The error I'm getting here is 'Could not resolve type id \'quantile\' as a subtype of [simple type, class io.druid.query.aggregation.PostAggregator]:

Druid Error: {'error': 'Unknown exception', 'errorMessage': 'Could not resolve type id \'quantile\' as a subtype of [simple type, class io.druid.query.aggregation.PostAggregator]: known type ids = [arithmetic, constant, doubleGreatest, doubleLeast, expression, fieldAccess, finalizingFieldAccess, hyperUniqueCardinality, javascript, longGreatest, longLeast, quantilesDoublesSketchToHistogram, quantilesDoublesSketchToQuantile, quantilesDoublesSketchToQuantiles, quantilesDoublesSketchToString, sketchEstimate, sketchSetOper, thetaSketchEstimate, thetaSketchSetOp] (for POJO property \'postAggregations\')\n at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 856] (through reference chain: io.druid.query.groupby.GroupByQuery["postAggregations"]->java.util.ArrayList[0])', 'errorClass': 'com.fasterxml.jackson.databind.exc.InvalidTypeIdException', 'host': None}

Solution

  • I modified the code of pydruid to get this working on our end. I've created new aggregator and postaggregator under /pydruid/utils.

    aggregator.py

    def quantilesDoublesSketch(raw_column, k=128):
        return {"type": "quantilesDoublesSketch", "fieldName": raw_column, "k": k}
    

    postaggregator.py

    class QuantilesDoublesSketchToQuantile(Postaggregator):
        def __init__(self, name: str, field_name: str, fraction: float):
            self.post_aggregator = {
                "type": "quantilesDoublesSketchToQuantile",
                "name": name,
                "fraction": fraction,
                "field": {
                    "fieldName": field_name,
                    "name": field_name,
                    "type": "fieldAccess",
                },
            }
    

    My first time to create a PR! Hopefully they accept and publish officially.

    https://github.com/druid-io/pydruid/pull/287