I have a performance issue in BigQuery while using LEFT JOIN. My goal is to switch words that are miss labeled.
e.g: someone wrote a label of bigaplle
. But its actually should be big apple
.
I join all the possible keys of bigaplle - big apple
options and then try to switch the "bad" values.
My problem is that the query is really slow, it takes hours and fails which is not regular in BigQuery.
My guess is that BigQuery does not know how to optimize
REGEXP_CONTAINS(cat.name, r"(?i:\b" || shortned_cat_map.shortned || r"\b)")
Because if I use:
cat.name = shortned_cat_map.shortned
It is quite fast but does not solve cases like red bigapple
I tried to use LIKE
instead of REGEX
and it's not better.
The labels column is an array of STRUCT
as: [{'name': my_label}, ...]
The src
table (after UNNEST
) contains 223M rows.
The shortned_cat_map
table contains 600K rows
WITH src AS (SELECT * FROM `my_table` ),
cat_src AS (SELECT
DISTINCT(cat.name),
FROM
src, UNNEST(labels) cat),
shortned_cat AS (SELECT
name,
REPLACE(name, ' ', '') shortned
FROM
cat_src
GROUP BY
1, 2
),
shortned_cat_map AS (SELECT
shortned_cat.name,
shortned_cat.shortned
FROM
cat_src
JOIN
shortned_cat
ON
LOWER(cat_src.name) = LOWER(shortned_cat.shortned)
GROUP BY
1, 2)
SELECT
id,
ARRAY_AGG(STRUCT(CASE WHEN shortned_cat_map.shortned IS NOT NULL THEN REGEXP_REPLACE(cat.name, r"(?i:\b" || shortned_cat_map.shortned || r"\b)", shortned_cat_map.name) ELSE cat.name END AS name)) AS categories
FROM
src, UNNEST(labels) cat
LEFT JOIN
shortned_cat_map
ON
REGEXP_CONTAINS(cat.name, r"(?i:\b" || shortned_cat_map.shortned || r"\b)")
GROUP BY
id
BigQuery step the got stuck:
I found the answer. For Anyone looking for JOIN using REGEX. It seems BQ looks for another parameter to hook on otherwise he is "lost" and has to scan all the rows from the JOIN table.
My trick was to add the id
column aggregate until the JOIN where its UNNEST.
I don't need for the JOIN logic I need it for BQ to be able to JOIN fast
WITH src AS (SELECT * FROM `my_table` ),
cat_src AS (SELECT
cat.name,
ARRAY_AGG(DISTINCT id) AS ids
FROM
src, UNNEST(labels) cat
GROUP BY
1),
shortned_cat AS (SELECT
name,
REPLACE(name, ' ', '') shortned,
ids,
FROM
cat_src
GROUP BY
1, 2
),
shortned_cat_map AS (SELECT
shortned_cat.name,
shortned_cat.shortned,
shortned_cat.ids
FROM
cat_src
JOIN
shortned_cat
ON
LOWER(cat_src.name) = LOWER(shortned_cat.shortned)
GROUP BY
1, 2),
shortned_cat_map_ids AS (SELECT
shortned,
name,
id
FROM
shortned_cat_map, UNNEST(ids) id)
SELECT
id,
ARRAY_AGG(STRUCT(CASE WHEN shortned_cat_map_ids.shortned IS NOT NULL THEN REGEXP_REPLACE(cat.name, r"(?i:\b" || shortned_cat_map_ids.shortned || r"\b)", shortned_cat_map_ids.name) ELSE cat.name END AS name)) AS categories
FROM
src, UNNEST(labels) cat
LEFT JOIN
shortned_cat_map_ids
ON
REGEXP_CONTAINS(cat.name, r"(?i:\b" || shortned_cat_map_ids.shortned || r"\b)")
AND shortned_cat_map_ids.id = src.id
GROUP BY
id