Search code examples
sqlgoogle-bigquerycumulative-sum

How to find the number of occurrences of sub-strings in another string in BigQuery with conditionals as a cumulative sum?


So my SQL/BiqQuery skills still need some refining and I have been stuck on this for some time now. I got a table ST that marks the planned mods to some applications with their starting date, these mods only apply to those with some specific specs:

Mod_ID App_ID Specs Start_Date
P1 App1 S1 2022-01-24
P1 App2 S2 2022-01-24
P2 App2 S3 S4 2022-03-15
P3 App3 S4 S5 2022-04-10

I got another table AQ that includes the production date for some devices, with their corresponding app and their specifications in the form of a string:

Device_ID App_ID Spec_string Prod_Date
ID0001 App1 S1 S4 S6 T0 U7 2022-02-03
ID0002 App2 S2 S5 U9 2022-02-05
ID0003 App2 S1 S2 S3 2022-03-12
ID0004 App3 S5 S6 T0 U7 2022-04-18

I want to count the occurrences of the different spec mods for a particuar device that apply to its App_ID and only for those that AQ.Prod_Date>=ST.Start_Date.

I first unnested table ST (ST_UNN) and ended with something like this:

Mod_ID App_ID Specs_ind Start_Date
P1 App1 S1 2022-01-24
P1 App2 S2 2022-01-24
P2 App2 S3 2022-03-15
P2 App2 S4 2022-03-15
P3 App3 S4 2022-04-10
P3 App3 S5 2022-04-10

and tried the following:

SELECT
  a.* EXCEPT(Spec_String),
  SUM(
    CASE
      WHEN s.App_ID = a.App_ID AND s.Start_Date <= a.Prod_Date THEN ARRAY_LENGTH(REGEXP_EXTRACT_ALL(Spec_String, Specs_ind))
    ELSE
    0
  END
    ) OVER(PARTITION BY s.Specs_ind ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS num_changes,
FROM
  AQ AS a
LEFT JOIN
  ST_UNN AS s
ON
  a.App_ID  = s.App_ID 

But it didn't work as intended. I expect something like this:

Device_ID App_ID Spec_string Prod_Date num_changes
ID0001 App1 S1 S4 S6 T0 U7 2022-02-03 1
ID0002 App2 S2 S5 U9 2022-02-05 1
ID0003 App2 S1 S3 U7 2022-03-12 0
ID0004 App3 S5 S6 T0 U7 2022-04-18 1

But I am rather lost on how to achieve this last part as a cumulative sum, so that a device will count all mod implementations that have happened before its production date and that apply based on its specifications.

Any feedback is greatly appreciated!


Solution

  • If I am understanding this correctly, you can simplify a little and just group by device_id, etc. instead of using a window function. I modified your input slightly just to show that this was working as intended

    WITH
      st AS (
      SELECT
        'P1' AS mod_id,
        'App1' AS app_id,
        'S1 S4' AS specs,
        '2022-01-24' AS start_date
      UNION ALL
      SELECT
        'P1',
        'App2',
        'S2',
        '2022-01-24'
      UNION ALL
      SELECT
        'P2',
        'App2',
        'S3 S4',
        '2022-03-15'
      UNION ALL
      SELECT
        'P3',
        'App3',
        'S4 S5',
        '2022-04-10'
      UNION ALL
      SELECT
        'P4',
        'App3',
        'S4 S5',
        '2022-04-11' ),
      st_unn AS (
      SELECT
        * EXCEPT(specs)
      FROM
        st,
        UNNEST(SPLIT(specs, ' ')) AS spec ),
      aq AS (
      SELECT
        -- should be 2
        'ID0001' AS device_id,
        'App1' AS app_id,
        'S1 S4 S6 T0 U7' AS spec_string,
        '2022-02-03' AS prod_date
      UNION ALL
      SELECT
        -- should be 1
        'ID0002',
        'App2',
        'S2 S5 U9',
        '2022-02-05'
      UNION ALL
      SELECT
      -- should be 0
        'ID0003',
        'App2',
        'S2 S5 U9',
        '2021-03-12'
      UNION ALL
      SELECT
        -- should be 3
        'ID0004',
        'App3',
        'S4 S5 S6 T0 U7',
        '2022-04-18' )
    SELECT
      device_id,
      aq.app_id,
      spec_string,
      prod_date,
      SUM(COALESCE(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(spec_string, spec)), 0))
    FROM
      aq
    LEFT JOIN
      st_unn st
    ON
      aq.prod_date >= st.start_date
      AND aq.app_id = st.app_id
    GROUP BY
      device_id,
      aq.app_id,
      spec_string,
      prod_date