Search code examples
sqlpostgresqlmetabase

How to select cumulative counts by group over time across in postgres


I have cumulative counts for two groups over time in this format:

Date Group Cumulative Count
1/1/2020 A 1
1/2/2020 A 3
1/2/2020 B 1
1/3/2020 B 2

And I'd like to reshape this data into this format:

Date Group Cumulative Count
1/1/2020 A 1
1/1/2020 B 0
1/2/2020 A 3
1/2/2020 B 1
1/3/2020 A 3
1/3/2020 B 2

So that I can get it to display accurately in a stacked area chart in metabase - any advice?


Solution

  • You may generate all possible pairs of dates and groups with a cross join before using a left join to get the combined dataset eg

    Since your dataset already has the cumulative counts, the missing values identified by null have been replaced using the most recent cumulative count with MAX and COALESCE.

    SELECT
        d."Date"::text,
        d."Group",
        COALESCE(m."CumulativeCount",COALESCE(MAX(m."CumulativeCount") OVER (
            PARTITION BY d."Group"
            ORDER BY d."Date"
        ),0)) as CumulativeCount
    FROM (
        SELECT "Date", "Group" FROM (
            SELECT DISTINCT
                "Date"
            FROM
                my_data
        ) t1
        CROSS JOIN (
            SELECT DISTINCT
                "Group"
            FROM
                my_data
        ) t2
    ) d 
    LEFT JOIN my_data m ON m."Date"=d."Date" AND
                           m."Group" = d."Group"
    ORDER BY 1,2;
    
    Date Group cumulativecount
    2020-01-01 A 1
    2020-01-01 B 0
    2020-01-02 A 3
    2020-01-02 B 1
    2020-01-03 A 3
    2020-01-03 B 2

    View working demo on DB Fiddle

    Update 1

    If it is that you would like to generate values for dates between missing dates eg you had the next date to be 1/7/2020 and you wanted to fill the gaps for 1/3/2020 you could use generate_series to generate the possible dates and MAX to get the most recent value. I have included a fiddle below with additional sample data eg

    Schema (PostgreSQL v13)

    CREATE TABLE my_data (
      "Date" DATE,
      "Group" VARCHAR(1),
      "CumulativeCount" INTEGER
    );
    
    INSERT INTO my_data
      ("Date", "Group", "CumulativeCount")
    VALUES
      ('1/1/2020', 'A', '1'),
      ('1/2/2020', 'A', '3'),
      ('1/2/2020', 'B', '1'),
      ('1/3/2020', 'B', '2'),
      ('1/1/2020', 'C', '2'),
      ('1/7/2020', 'C', '3');
    

    Query #1

    SELECT
        d."Date"::text,
        d."Group",
        COALESCE(
            m."CumulativeCount",
            COALESCE(MAX(m."CumulativeCount") OVER (
                PARTITION BY d."Group"
                ORDER BY d."Date"
            ),0)
        ) as CumulativeCount
    FROM (
        SELECT "Date", "Group" FROM (
            SELECT
                GENERATE_SERIES(
                    MIN("Date"),
                    MAX("Date"),
                    INTERVAL '1' DAY
                ) as "Date"
            FROM
                my_data
        ) t1
        CROSS JOIN (
            SELECT DISTINCT
                "Group"
            FROM
                my_data
        ) t2
    ) d 
    LEFT JOIN my_data m ON m."Date"=d."Date" AND
                           m."Group" = d."Group"
    ORDER BY 1,2;
    
    Date Group cumulativecount
    2020-01-01 00:00:00+00 A 1
    2020-01-01 00:00:00+00 B 0
    2020-01-01 00:00:00+00 C 2
    2020-01-02 00:00:00+00 A 3
    2020-01-02 00:00:00+00 B 1
    2020-01-02 00:00:00+00 C 2
    2020-01-03 00:00:00+00 A 3
    2020-01-03 00:00:00+00 B 2
    2020-01-03 00:00:00+00 C 2
    2020-01-04 00:00:00+00 A 3
    2020-01-04 00:00:00+00 B 2
    2020-01-04 00:00:00+00 C 2
    2020-01-05 00:00:00+00 A 3
    2020-01-05 00:00:00+00 B 2
    2020-01-05 00:00:00+00 C 2
    2020-01-06 00:00:00+00 A 3
    2020-01-06 00:00:00+00 B 2
    2020-01-06 00:00:00+00 C 2
    2020-01-07 00:00:00+00 A 3
    2020-01-07 00:00:00+00 B 2
    2020-01-07 00:00:00+00 C 3

    View working demo on DB Fiddle