Search code examples
regexgoogle-bigqueryre2

BigQuery - Count non-nulls across columns where the column name matches regex patterns


Very new to BQ so please be patient.

I have a table with various columns (in the example they are col1 to col4 for brevity) and I have a regex that determines which column names are to be grouped together (example - ac_v\d+_final_p\w+). What I want to do is identify how many times non-null values appear within the grouping of columns for a specific row. From research, I was able to craft the attached query, however, this obviously only returns the count across the entire table and not the rows associated with the defined regex as desired.

Simplified Data Structure:

key col1 col2 col3 col4 lol1
2 0.0025 null null null null
3 0.0015 null 0.0005 null null
1 null null null 0.000 0.3

Desired Outcome: I want to count only those columns with the col name prefix.

key count_non_nulls count_nulls
1 1 3
2 1 3
3 2 2

Is there a way to achieve this in BQ Standard SQL?

Thank you in anticipation of your help.

BEGIN
#standardSQL
CREATE TEMP TABLE `mytable` AS (
  SELECT 1 AS key, null AS col1, null AS col2, null AS col3, 0.0001 AS col4 UNION ALL
  SELECT 2, 0.0025, null, null, null UNION ALL
  SELECT 3, 0.0015,  null, 0.0005, null
)
;

SELECT 
  COUNTIF(value not in  ('null', '')) AS count_non_nulls, 
  COUNTIF(value in  ('null', '')) AS count_nulls, 
  COUNT(value) count_all
FROM `mytable` t, 
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'":(.*?)(?:,"|})')) value
;

END

Solution

  • Consider below approach

    select key, 
      (
        select as struct 
          countif(column_value != 'null') as count_non_nulls,
          countif(column_value = 'null') as count_nulls
        from unnest(split(translate(to_json_string(t), '{}"', ''))) kv,
        unnest([struct(split(kv, ':')[offset(0)] as column_name, split(kv, ':')[offset(1)] as column_value)])
        where column_name != 'key'
        and starts_with(column_name, 'col')
      ).*
    from `project.dataset.table` t  
    

    if applied to sample data in your question - output is

    enter image description here

    Note: if you need to use whatever regex you have - you can use it instead of below line

    starts_with(column_name, 'col')