Search code examples
pipelinedb

PipelineDB, get counts for top K items


How to calculate frequencies of top K values in the stream?

Let's say we have a stream

CREATE STREAM stream (
  value number
);

And we inserted ten rows

INSERT INTO stream (value) VALUES (1)
INSERT INTO stream (value) VALUES (1)
INSERT INTO stream (value) VALUES (1)
INSERT INTO stream (value) VALUES (2)
INSERT INTO stream (value) VALUES (2)
INSERT INTO stream (value) VALUES (3)
INSERT INTO stream (value) VALUES (4)
INSERT INTO stream (value) VALUES (5)
INSERT INTO stream (value) VALUES (6)
INSERT INTO stream (value) VALUES (7)

How can I get back the top 2 items and their frequencies?

value | frequency
-----------------
1     | 0.3
2     | 0.2

I suppose it should somehow use both Top K and the Count-min Sketch together?


Solution

  • You can use fss_agg for that:

    CREATE CONTINUOUS VIEW v AS
      SELECT fss_agg(x, 10) AS top_10_x FROM some_stream
    

    This will keep track of the top 10 most frequently occurring values of x. The weight given to each value can also be explicitly given:

    CREATE CONTINUOUS VIEW v AS
      SELECT fss_agg_weighted(x, 10, y) AS top_10_x FROM some_stream
    

    The first version implicitly uses a weight of 1.

    There are various functions you can use to read the top-K values and their associated frequencies. For example, the following will return tuples of the form: (value, frequency):

    SELECT fss_topk(top_10_x) FROM v