Search code examples
google-bigquerybigquery-udf

BigQuery: Count consecutive string matches between two fields


I have two tables:

  1. Master_Equipment_Index (alias mei) containing the columns serial_num & model_num
  2. Customer Equipment Index (alias cei) containing the columns account_num, serial_num, & model_num

Originally, guard rails were not implemented to require model attribute input in the mei data whenever new serial_num records were inserted. Whenever that serial_num is later associated with a customer account in the cei data, the model data carries over as null.

What I want to do is backfill the missing model attributes in the cei data from the mei data based on the strongest sequential character match from other similar serial_nums in the mei data.

To further clarify, I don't have access to mass update the mei or cei datasets. I can formalize change requests, but I need to build the function out to prove its worth. So this has to be done outside of any mass action query updates.

cei.account_num cei.serial_num cei.model mei.serial_num mei.model serial_num_str_match row_number
123123123 B4I4SXT1708 null B4I4SXT178A Model_Series1 8 1
123123123 B4I4SXT1708 null B4I4SXTAS34 Model_Series2 7 2

In the table example above row_number 1 has a higher consecutive string match count than row_number 2. I want to only return row_number 1 and populate cei.model with mei.model's value.

cei.account_num cei.serial_num cei.model mei.serial_num mei.model serial_num_str_match row_number
123123123 B4I4SXT1708 Model_Series1 B4I4SXT178A Model_Series1 8 1

To give an idea as to scale:
The mei data contains 1 million records and the cei data contains 50,000 records. I would have to take and perform this string match for every single cei.account_num, cei.serial_num where the cei.model data is null.

With mac addresses, the first 6 characters identify the vendor and I could look at things similarly in the sample SQL below to help reduce the volume of transactional 1:Many lookups taking place:

/* need to define function */
create temp function string_match_function(x any type, y any type) as (
  syntax to generate consecutive string count matches between x and y
);

select * from (
select
  c.account_num,
  c.serial_num,
  m.model,
  row_number() over(partition by c.account_num, c.serial_num order by serial_num_str_match desc) seq
from (
     select 
       c.account_num, 
       c.serial_num, 
       m.model,

       needed: string_match_function(c.serial_num, m.serial_num) as serial_num_str_match

     from (
          select * from cei where model is null
          ) c
     join (
          select * from mei where model is not null
          ) m on substr(c.serial_num,1,6) = substr(m.serial_num,1,6)
) as a
) as b
where seq = 1

I've looked at different options, some coming from https://hoffa.medium.com/new-in-bigquery-persistent-udfs-c9ea4100fd83, but I'm not finding what I need.

Any insight or direction would be greatly appreciated.


Solution

  • This UDF function counts the equal charachters in each string from the begin:

    CREATE TEMP FUNCTION string_match_function(x string, y string)
    RETURNS int64
    LANGUAGE js
    AS r"""
      var i=0;
      var max_len= Math.min(x.length,y.length);
        for(i=0;i<max_len;i++){
            if(x[i]!=y[i]) {return i;}
        }
      return i;
    """;
    
    select string_match_function("12a345","1234")
    

    gives 2, because both start with 12