Search code examples
sqlgoogle-bigqueryquery-optimization

Optimise BigQuery query joining with regex


I have a performance issue in BigQuery while using LEFT JOIN. My goal is to switch words that are miss labeled.

e.g: someone wrote a label of bigaplle. But its actually should be big apple.

I join all the possible keys of bigaplle - big apple options and then try to switch the "bad" values.

My problem is that the query is really slow, it takes hours and fails which is not regular in BigQuery.

My guess is that BigQuery does not know how to optimize

REGEXP_CONTAINS(cat.name, r"(?i:\b" || shortned_cat_map.shortned || r"\b)") 

Because if I use:

cat.name = shortned_cat_map.shortned

It is quite fast but does not solve cases like red bigapple

I tried to use LIKE instead of REGEX and it's not better.

The labels column is an array of STRUCT as: [{'name': my_label}, ...]

The src table (after UNNEST) contains 223M rows. The shortned_cat_map table contains 600K rows

    WITH src AS (SELECT * FROM `my_table`  ),

    cat_src AS (SELECT
        DISTINCT(cat.name),
    FROM  
        src, UNNEST(labels) cat),


   shortned_cat AS (SELECT
            name,
            REPLACE(name, ' ', '') shortned
        FROM  
            cat_src
        GROUP BY
            1, 2
    ),

    shortned_cat_map AS (SELECT
        shortned_cat.name,
        shortned_cat.shortned
    FROM
        cat_src
    JOIN
        shortned_cat
    ON
        LOWER(cat_src.name) = LOWER(shortned_cat.shortned)
    GROUP BY
        1, 2)
        
    SELECT
        id,
        ARRAY_AGG(STRUCT(CASE WHEN shortned_cat_map.shortned IS NOT NULL THEN REGEXP_REPLACE(cat.name,  r"(?i:\b" || shortned_cat_map.shortned || r"\b)", shortned_cat_map.name) ELSE cat.name END AS name)) AS categories
    FROM 
        src, UNNEST(labels) cat
    LEFT JOIN
        shortned_cat_map
    ON
        REGEXP_CONTAINS(cat.name, r"(?i:\b" || shortned_cat_map.shortned || r"\b)") 
    GROUP BY
       id

BigQuery step the got stuck:

enter image description here


Solution

  • I found the answer. For Anyone looking for JOIN using REGEX. It seems BQ looks for another parameter to hook on otherwise he is "lost" and has to scan all the rows from the JOIN table.

    My trick was to add the id column aggregate until the JOIN where its UNNEST.

    I don't need for the JOIN logic I need it for BQ to be able to JOIN fast

     WITH src AS (SELECT * FROM `my_table`  ),
    
        cat_src AS (SELECT
            cat.name,
            ARRAY_AGG(DISTINCT id) AS ids
        FROM  
            src, UNNEST(labels) cat
        GROUP BY 
          1),
    
    
       shortned_cat AS (SELECT
                name,
                REPLACE(name, ' ', '') shortned,
                ids,
            FROM  
                cat_src
            GROUP BY
                1, 2
        ),
    
        shortned_cat_map AS (SELECT
            shortned_cat.name,
            shortned_cat.shortned,
            shortned_cat.ids
        FROM
            cat_src
        JOIN
            shortned_cat
        ON
            LOWER(cat_src.name) = LOWER(shortned_cat.shortned)
        GROUP BY
            1, 2),
            
        shortned_cat_map_ids AS (SELECT
            shortned,
            name,
            id
        FROM
            shortned_cat_map, UNNEST(ids) id)
    
    
        SELECT
            id,
            ARRAY_AGG(STRUCT(CASE WHEN shortned_cat_map_ids.shortned IS NOT NULL THEN REGEXP_REPLACE(cat.name,  r"(?i:\b" || shortned_cat_map_ids.shortned || r"\b)", shortned_cat_map_ids.name) ELSE cat.name END AS name)) AS categories
        FROM 
            src, UNNEST(labels) cat
        LEFT JOIN
            shortned_cat_map_ids
        ON
            REGEXP_CONTAINS(cat.name, r"(?i:\b" || shortned_cat_map_ids.shortned || r"\b)") 
            AND shortned_cat_map_ids.id = src.id
        GROUP BY
           id