Search code examples
sqlgoogle-bigquerydata-warehouse

bigquery aggregate for daily basis


I have a table in big-query (datawarehouse):

enter image description here

and I would like to have the result of:

enter image description here

Here is the explanation on how the calculation should be:

  1. 2017-10-01 = $100 is obvious, because the data is only one
  2. 2017-10-02 = $400 is a sum of the first row and third row. Why? Because second row and third row have the same invoice. So we only use the latest update.
  3. 2017-10-04 = $800 is a sum of row 1,3, and 4. Why? It is because we only take one invoice only per day. row 1 (T001), row 3(T002), row 4(T003)
  4. 2017-10-05 = $100 is a sum of row 1,5, and 6. Why? It is because we only take one invoice only per day. row 1 (T001), row 5(T002), row 6(T003)

I honestly have completely lost how to do that. I have tried multiple times to group by and etc. But none of them work as expected. This is my latest effort so far for today:

SELECT 
  amount,
  updatedDateOnly,
  invNo
FROM 
(
  SELECT 
    invNo,
    UpdatedDate,
    amount,
    DATE(updatedDate) as updatedDateOnly,
    row_number() OVER (PARTITION BY  invNo ORDER BY UpdatedDate DESC) AS rownum
  FROM [project:dataset.test] 
)
WHERE
  rownum = 1

only returns the last date. Now, I have no idea how to query for daily basis.

Appreciate for anyone who is expert and willing to help in querying. Thank you.

UPDATE: Data in json, in case you want to try in bigquery or other SQL servers:

{"UpdatedDate":"2017-10-01 01:00:00","InvNo":"T001","amount":100}
{"UpdatedDate":"2017-10-02 01:00:00","InvNo":"T002","amount":200}
{"UpdatedDate":"2017-10-02 02:00:00","InvNo":"T002","amount":300}
{"UpdatedDate":"2017-10-04 01:00:00","InvNo":"T003","amount":400}
{"UpdatedDate":"2017-10-05 01:00:00","InvNo":"T002","amount":500}
{"UpdatedDate":"2017-10-05 02:00:00","InvNo":"T003","amount":500}

Solution

  • Below is for BigQuery Standard SQL

    #standardSQL
    WITH dates AS (
      SELECT DISTINCT DATE(UpdatedDate) UpdatedDay
      FROM `project.dataset.test`
    ),
    qualified AS (
      SELECT DATE(UpdatedDate) UpdatedDay, InvNo, ARRAY_AGG(amount ORDER BY UpdatedDate DESC LIMIT 1)[SAFE_OFFSET(0)] amount
      FROM `project.dataset.test`
      GROUP BY UpdatedDay, InvNo
    )
    SELECT UpdatedDay, SUM(amount) amount
    FROM (
      SELECT d.UpdatedDay UpdatedDay, InvNo, ARRAY_AGG(amount ORDER BY q.UpdatedDay DESC LIMIT 1)[SAFE_OFFSET(0)] amount
      FROM dates d
      JOIN qualified q
      ON q.UpdatedDay <= d.UpdatedDay
      GROUP BY UpdatedDay, InvNo
    )
    GROUP BY UpdatedDay
    -- ORDER BY UpdatedDay
    

    You can test / play with this with below dummy data from your question

    #standardSQL
    WITH `project.dataset.test` AS (
      SELECT TIMESTAMP '2017-10-01 01:00:00' UpdatedDate, 'T001' InvNo, 100 amount UNION ALL
      SELECT TIMESTAMP '2017-10-02 01:00:00', 'T002', 200 UNION ALL
      SELECT TIMESTAMP '2017-10-02 02:00:00', 'T002', 300 UNION ALL
      SELECT TIMESTAMP '2017-10-04 01:00:00', 'T003', 400 UNION ALL
      SELECT TIMESTAMP '2017-10-05 01:00:00', 'T002', 500 UNION ALL
      SELECT TIMESTAMP '2017-10-05 02:00:00', 'T003', 500 
    ),
    dates AS (
      SELECT DISTINCT DATE(UpdatedDate) UpdatedDay
      FROM `project.dataset.test`
    ),
    qualified AS (
      SELECT DATE(UpdatedDate) UpdatedDay, InvNo, ARRAY_AGG(amount ORDER BY UpdatedDate DESC LIMIT 1)[SAFE_OFFSET(0)] amount
      FROM `project.dataset.test`
      GROUP BY UpdatedDay, InvNo
    )
    SELECT UpdatedDay, SUM(amount) amount
    FROM (
      SELECT d.UpdatedDay UpdatedDay, InvNo, ARRAY_AGG(amount ORDER BY q.UpdatedDay DESC LIMIT 1)[SAFE_OFFSET(0)] amount
      FROM dates d
      JOIN qualified q
      ON q.UpdatedDay <= d.UpdatedDay
      GROUP BY UpdatedDay, InvNo
    )
    GROUP BY UpdatedDay
    ORDER BY UpdatedDay
    

    Result is as expected

    UpdatedDay  amount   
    2017-10-01   100     
    2017-10-02   400     
    2017-10-04   800     
    2017-10-05  1100