Search code examples
joingoogle-bigquerywildcardcontains

In Bigquery can I join two tables using Contains/Not Contains with wildcards?


I am trying to join two tables where values from the first table need to be included in values in one field from the second table but excluded from values in a different field. All fields are non-array strings.

As an example, I have table1 that I wanted joined to table2 on the conditions where table1.field1 is in table2.field1yes but not in table2.field1no.

table1 has rows that should join with rows in table2 as follows:

  • table1.field1 = '123100'
    • table2.field1yes = '___100,___200'
    • table2.field2no = Null
  • table1.field1 = '321300'
    • table2.field1yes = %
    • table2.field1no = '___100,___200'

Where _ represents a single character wildcard and % represents 0 or more character wildcard

Written in code, I've tried to do the following to join the tables in the example above:

WITH
  table1  AS (SELECT '123100' field1 UNION ALL
              SELECT '321300'),
  table2  AS (SELECT 'data1' data, 'r\'^([\\d]+100|[\\d]+200)$\'' field1yes, '' field1no UNION ALL
              SELECT 'data2',      'r\'^([0-9A-Za-z])$\'',                   'r\'^([\\d]+100|[\\d]+200)$\'')

SELECT
  *
FROM table1
JOIN table2
ON 0=0
  AND     REGEXP_CONTAINS(table1.field1, table2.field1yes)
  AND NOT REGEXP_CONTAINS(table1.field1, table2.field1no)

But it returns no results, even though the following does returns the first row from table1 a result as expected:

SELECT
  *
FROM table1
WHERE 0=0
  AND     REGEXP_CONTAINS(table1.field1, r'^([\d]+100|[\d]+200)$')

and the following returns the second row from table1:

SELECT
  *
FROM table1
WHERE 0=0
  AND NOT REGEXP_CONTAINS(table1.field1, r'^([\d]+100|[\d]+200)$')


Is this the right approach? If so, what tweaks do I need to make to my query to return results? If not, is there a different solution to match these two tables?


Solution

  • Below is for BigQuery Standard SQL

    Is this the right approach?

    Yes, you can preset your regexp and then use it in REGEXP functions

    what tweaks do I need to make to my query to return results?

    I hope below makes it very clear what was "wrong" in your original query. I mostly mean within TCE for table2 (specifically - field1yes and field1no)

    #standardSQL
    WITH table1  AS (
      SELECT '123100' field1 UNION ALL
      SELECT '321300'
    ), table2  AS (
      SELECT 'data1' data, r'^([\d]+100|[\d]+200)$' field1yes, '' field1no UNION ALL
      SELECT 'data2',      r'^([0-9A-Za-z])$',                   r'^([\d]+100|[\d]+200)$'
    )
    SELECT *
    FROM table1
    JOIN table2
    ON REGEXP_CONTAINS(table1.field1, table2.field1yes)
    OR NOT REGEXP_CONTAINS(table1.field1, table2.field1no)    
    

    Note: I focused on fixing the main issue with your initial query which was a representation of regexp expressions and assumed that the rest (logic of joining, etc.) is up to you to tune