I am working with uri in bigquery, however, some of our uri have utm parameters in chinese and I am unable to decode them. I tried user defined function in Bigquery but it didn't work out. Here is the sample udf I tried
DECLARE uri STRING;
SET uri = 'https://www.random.cn/services/modifcation?utm_medium=cpc&utm_source=baidu&utm_term=%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF';
CREATE TEMP FUNCTION DecodeKeywords(encodedKeyword STRING) RETURNS STRING LANGUAGE js AS R"""
try {
return decodeURIComponent(encodedKeyword);
} catch (error) {
// Handle the error gracefully
return encodedKeyword ;
}
""";
select uri, DecodeKeywords(uri) as decoded_uri
I also, tried the same with urllib.parse.unquote in python ( solution generated by chatGPT)
import urllib.parse
# Example URL with encoded characters
url = "https://www.random.cn/services/modifcation?utm_medium=cpc&utm_source=baidu&utm_term=%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF"
try:
# Decode the URL with utf-8 encoding (common for web)
decoded_url = urllib.parse.unquote(url, encoding='utf-8')
except UnicodeDecodeError:
# If decoding with utf-8 fails, try using latin-1 encoding (fallback)
decoded_url = urllib.parse.unquote(url, encoding='latin-1')
# Print the decoded URL
print(decoded_url)
But I am unable to translate it into proper keyword, however if I paste the utm_term(%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF
) into chatGPT it shows me the output("设施周期开始"
)
I tried both User Defined function in BigQuery as well as urllib.parse.unquote in python.
Found the answer, the term is decoded multiple times, hence has to do 3 layer of decoding to get the Chinese term.
DECLARE uri STRING; SET uri = 'https://www.random.cn/services/modifcation?utm_medium=cpc&utm_source=baidu&utm_term=%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF';
CREATE TEMP FUNCTION DecodeKeywords(encodedKeyword STRING) RETURNS STRING LANGUAGE js AS R""" try {
return decodeURIComponent(encodedKeyword); } catch (error) {
return encodedKeyword ; } """;
select uri, DecodeKeywords(DecodeKeywords(DecodeKeywords(uri))) as decoded_uri