Search code examples
sqlgoogle-bigquerydata-partitioning

BigQuery: iterating groups within a window of 28days before a start_date column using _TABLE_SUFFIX


I got a table like this:

group_id start_date end_date
19335 20220613 20220714
19527 20220620 20220719
19339 20220614 20220720
19436 20220616 20220715
20095 20220711 20220809

I am trying to retrieve data from another table that is partitioned, and data should be access with _TABLE_SUFFIX BETWEEN start_date AND end_date.

Each group_id contains different user_id within the period [start_date, end_date]. What I need is to retrieve data of users of a column/metric of the last 28D prior to the start_date of each group_id.

My idea is to:

  1. Retrieve distinct user_id per group_id within the period [start_date, end_date]
  2. Retrieve previous 28d metric data prior to the start date of each group_id

A snippet code on how to retrieve data from a single group_id is the following:

 WITH users_per_group AS (
  SELECT 
    users_metadata.user_id, 
    users_metadata.group_id, 
  FROM 
    `my_table_users_*` users_metadata
  WHERE 
    _TABLE_SUFFIX BETWEEN '20220314' --start_date
    AND '20220413' --end_date
    AND experiment_id = 16709 
  GROUP BY 
    1, 
    2
) 

SELECT 
  _TABLE_SUFFIX AS date, 
  user_id, 
  SUM(
    COALESCE(metric, 0)
  ) AS metric, 
FROM 
  users_per_group 
  JOIN `my_metric_table*` metric USING (user_id) 
WHERE 
  _TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP(
    '%Y%m%d', 
    TIMESTAMP_SUB(
      PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
      INTERVAL 28 DAY
    )
  ) -- 28 days before it starts
  AND FORMAT_TIMESTAMP(
    '%Y%m%d', 
    TIMESTAMP_SUB(
      PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
      INTERVAL 1 DAY
    )
  ) -- 1 day before it starts
GROUP BY 
  1, 
  2
ORDER BY 
  date ASC

Also, I want to avoid retrieving all data (considering all dates) from that metric, as the table is huge and it will take very long time to retrieve it.

Is there an easy way to retrieve the metric data of each user across groups and considering the previous 28 days to the start data of each group_id?


Solution

  • I can think of 2 approaches.

    1. Join all the tables and then perform your query.
    2. Create dynamic queries for each of your users.

    Both approaches will require search_from and search_to to be available beforehand i.e you need to calculate each user's search range before you do anything.
    EG:

    WITH users_per_group AS (
      SELECT 
        user_id, group_id
       ,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
       ,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
      FROM TableName
    ) 
    

    Once you have this kind of table then you can use any of the mentioned approaches.

    Since I don't have your data and don't know about your table names I am giving an example using a public dataset.
    Approach 1

    -- consider this your main table which contains user,grp,start_date,end_date
    with  maintable as (
      select 'India' visit_from,  '20161115' as start_date, '20161202' end_date
      union all select 'Sweden'  , '20161201', '20161202' 
    ),
    --then calculate  search from-to date for every user and group
    user_per_grp as( 
    select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from  --change interval as per your need
    ,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
    from maintable 
    )
    
    select visit_from,_TABLE_SUFFIX date,count(visitId) total_visits  from
    user_per_grp ug 
    left join `bigquery-public-data.google_analytics_sample.ga_sessions_*` as pub on pub.geoNetwork.country = ug.visit_from
    where _TABLE_SUFFIX between format_date("%Y%m%d",ug.search_from) and format_date("%Y%m%d",ug.search_to)
    group by 1,2
    

    Approach 2

    declare queries array<string> default [];
    create temp table  maintable as (
      select 'India' visit_from,  '20161115' as start_date, '20161202' end_date
      union all select 'Sweden'  , '20161201', '20161202' 
    );
    create temp table user_per_grp as( 
    select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
    ,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
    from maintable 
    );
    
    -- for each user create a seperate query here 
    FOR record IN (SELECT * from user_per_grp)
    DO
      set queries =  queries || [format('select "%s" Visit_From,_TABLE_SUFFIX Date,count(visitId) total_visits  from `bigquery-public-data.google_analytics_sample.ga_sessions_*` where _TABLE_SUFFIX between format_date("%%Y%%m%%d","%t") and format_date("%%Y%%m%%d","%t") and geoNetwork.country="%s" group by 1,2',record.visit_from,record.search_from,record.search_to,record.visit_from)];
      --replace your query here. 
    END FOR;
    
    --aggregating all the queries and executing it
    execute immediate (select string_agg(query, ' union all ') from unnest(queries) query);
    

    Here the 2nd approach processed much less data(~750 KB) than the 1st approach(~17 MB). But that might not be the same for your dataset as the date range may overlap for 2 users and that will lead to reading the same table twice.