Search code examples
sqlgoogle-bigquerycountgaps-and-islandsgaps-in-data

Count number of null rows along with the first non null row and reset the count after that


HOUR Account_id media_id impressions
2022-11-04 04:00:00 UTC 256789 35 null
2022-11-04 05:00:00 UTC 256789 35 null
2022-11-04 06:00:00 UTC 256789 35 null
2022-11-04 07:00:00 UTC 256789 35 null
2022-11-04 08:00:00 UTC 256789 35 40
2022-11-04 09:00:00 UTC 256789 35 7
2022-11-04 10:00:00 UTC 256789 35 null
2022-11-04 11:00:00 UTC 256789 35 10
2022-11-04 12:00:00 UTC 256789 35 12

What we are trying to do is that when the impressions is count is null for an hour, then we take the value from the impressions where it is not null and then split the number evenly across the previously consecutive null rows and the first non null row.

If we take the row where the impressions count is 40 in the above 4 rows the impressions is null so including the row where the impressions is 40 makes the count as 5, then we divide 40 by 5 hence each hour gets 8 impressions.

The same above logic can be applied to the row where the impressions count is 10. It is distributed between 2 rows evenly hence in the output it is 5 impressions for each hour.

Here HOUR column is an increment of one hour with no gaps in between.

The query looks like this:

select *,
      case when impressions is null then row_number() over(partition by media_id,ACCOUNT_ID ORDER BY HOUR) else 0 end as rn1,
from table_name order by 1 ;

How I take it from there?

Expected Output:

HOUR Account_id media_id impressions distributed_impressions
2022-11-04 04:00:00 UTC 256789 35 null 8
2022-11-04 05:00:00 UTC 256789 35 null 8
2022-11-04 06:00:00 UTC 256789 35 null 8
2022-11-04 07:00:00 UTC 256789 35 null 8
2022-11-04 08:00:00 UTC 256789 35 40 8
2022-11-04 09:00:00 UTC 256789 35 7 7
2022-11-04 10:00:00 UTC 256789 35 null 5
2022-11-04 11:00:00 UTC 256789 35 10 5
2022-11-04 12:00:00 UTC 256789 35 12 12

Solution

  • Consider below query.

    SELECT * EXCEPT(part),
           MAX(impressions) OVER w1 / COUNT(*) OVER W1 AS distributed_impressions 
      FROM (
        SELECT *, COUNT(*) OVER w0 - COUNTIF(impressions IS NULL) OVER w0 AS part
          FROM sample_table
        WINDOW w0 AS (PARTITION BY Account_id, media_id ORDER BY HOUR ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
      ) WINDOW w1 AS (PARTITION BY Account_id, media_id, part);
    

    Query results

    enter image description here

    How each island is identified

    enter image description here