Optimise BigQuery query joining with regex

I have a performance issue in BigQuery while using LEFT JOIN. My goal is to switch words that are miss labeled.

e.g: someone wrote a label of bigaplle. But its actually should be big apple.

I join all the possible keys of bigaplle - big apple options and then try to switch the "bad" values.

My problem is that the query is really slow, it takes hours and fails which is not regular in BigQuery.

My guess is that BigQuery does not know how to optimize

REGEXP_CONTAINS(cat.name, r"(?i:\b" || shortned_cat_map.shortned || r"\b)")

Because if I use:

cat.name = shortned_cat_map.shortned

It is quite fast but does not solve cases like red bigapple

I tried to use LIKE instead of REGEX and it's not better.

The labels column is an array of STRUCT as: [{'name': my_label}, ...]

The src table (after UNNEST) contains 223M rows. The shortned_cat_map table contains 600K rows

    WITH src AS (SELECT * FROM `my_table`  ),

    cat_src AS (SELECT
        DISTINCT(cat.name),
    FROM  
        src, UNNEST(labels) cat),


   shortned_cat AS (SELECT
            name,
            REPLACE(name, ' ', '') shortned
        FROM  
            cat_src
        GROUP BY
            1, 2
    ),

    shortned_cat_map AS (SELECT
        shortned_cat.name,
        shortned_cat.shortned
    FROM
        cat_src
    JOIN
        shortned_cat
    ON
        LOWER(cat_src.name) = LOWER(shortned_cat.shortned)
    GROUP BY
        1, 2)
        
    SELECT
        id,
        ARRAY_AGG(STRUCT(CASE WHEN shortned_cat_map.shortned IS NOT NULL THEN REGEXP_REPLACE(cat.name,  r"(?i:\b" || shortned_cat_map.shortned || r"\b)", shortned_cat_map.name) ELSE cat.name END AS name)) AS categories
    FROM 
        src, UNNEST(labels) cat
    LEFT JOIN
        shortned_cat_map
    ON
        REGEXP_CONTAINS(cat.name, r"(?i:\b" || shortned_cat_map.shortned || r"\b)") 
    GROUP BY
       id

BigQuery step the got stuck:

Solution

I found the answer. For Anyone looking for JOIN using REGEX. It seems BQ looks for another parameter to hook on otherwise he is "lost" and has to scan all the rows from the JOIN table.

My trick was to add the id column aggregate until the JOIN where its UNNEST.

I don't need for the JOIN logic I need it for BQ to be able to JOIN fast

 WITH src AS (SELECT * FROM `my_table`  ),

    cat_src AS (SELECT
        cat.name,
        ARRAY_AGG(DISTINCT id) AS ids
    FROM  
        src, UNNEST(labels) cat
    GROUP BY 
      1),


   shortned_cat AS (SELECT
            name,
            REPLACE(name, ' ', '') shortned,
            ids,
        FROM  
            cat_src
        GROUP BY
            1, 2
    ),

    shortned_cat_map AS (SELECT
        shortned_cat.name,
        shortned_cat.shortned,
        shortned_cat.ids
    FROM
        cat_src
    JOIN
        shortned_cat
    ON
        LOWER(cat_src.name) = LOWER(shortned_cat.shortned)
    GROUP BY
        1, 2),
        
    shortned_cat_map_ids AS (SELECT
        shortned,
        name,
        id
    FROM
        shortned_cat_map, UNNEST(ids) id)


    SELECT
        id,
        ARRAY_AGG(STRUCT(CASE WHEN shortned_cat_map_ids.shortned IS NOT NULL THEN REGEXP_REPLACE(cat.name,  r"(?i:\b" || shortned_cat_map_ids.shortned || r"\b)", shortned_cat_map_ids.name) ELSE cat.name END AS name)) AS categories
    FROM 
        src, UNNEST(labels) cat
    LEFT JOIN
        shortned_cat_map_ids
    ON
        REGEXP_CONTAINS(cat.name, r"(?i:\b" || shortned_cat_map_ids.shortned || r"\b)") 
        AND shortned_cat_map_ids.id = src.id
    GROUP BY
       id