Search code examples
postgresqldbtpercentile

How do I get cume_dist (or percentile) of a value from one column in another?


I'm using postgresql in DBT, so either a postgresql or a DBT solution is viable.

I have a table with fundamentally 4 columns, target, question, value, current

I can get the percentile (cume_dist) of the value with:

(cume_dist() over (partition by question order by value)) * 100 as percentile

However, targets where the current is zero aren't in the comparison. I need the percentile of them, but they aren't in the "dataset" (they are historical data from the same targets, so they aren't current, but I need the percentile as if they were current)

However, I effectively need the percentile of every value, within the subset of values where current is 1.

ETA: I see in the documentation that cume_dist can take args, but I haven't found any documentation on what those args represent or how to use them...


Solution

  • Re-implement cume_dist()

    As far as I know, there is no way to do this with the existing cume_dist() function, but it's pretty easy to implement the same functionality without cume_dist().

    The postgres docs say cume_dist()

    Returns the cumulative distribution, that is (number of partition rows preceding or peers with current row) / (total partition rows). The value thus ranges from 1/N to 1.

    Here is how I set up my demo data:

    CREATE TABLE my_table (
      target VARCHAR,
      question VARCHAR,
      value INT,
      current BOOL
    );
    
    INSERT INTO my_table VALUES
    ('alice',   'number of hats', 7, true),
    ('alice',   'number of hats', 6, false),
    ('alice',   'number of hats', 3, false),
    ('bob',     'number of hats', 4, true),
    ('charles', 'number of hats', 9, true),
    ('david',   'number of hats', 0, true),
    ('erin',    'number of hats', 4, true),
    ('alice',   'number of fingers', 10, true),
    ('bob',     'number of fingers', 10, true),
    ('charles', 'number of fingers',  9, true),
    ('charles', 'number of fingers', 10, false),
    ('david',   'number of fingers', 10, true),
    ('erin',    'number of fingers', 10, true);
    

    and here is how to do that same thing (rows preceding) / (total rows) without cume_dist() and with some added logic to exclude rows where that aren't current:

    SELECT
      target,
      question,
      value,
      current,
      (
        -- this is the count of partition rows preceding this row
        SELECT COUNT(1)
        FROM my_table AS inner_table
        WHERE
          inner_table.question = outer_table.question AND
          current = true AND
          inner_table.value <= outer_table.value
      )::real / (
        -- this is the total number of rows for this question
        SELECT COUNT(1)
        FROM my_table AS inner_table
        WHERE
          inner_table.question = outer_table.question AND
          current = true
      )::real * 100.0 AS percentile
    FROM my_table AS outer_table;
    

    Here is a fiddle link with this query