Search code examples
python-3.xurlgoogle-bigquerybigquery-udf

Decoding url (Unicode to UTF-8) in Bigquery /Python


I am working with uri in bigquery, however, some of our uri have utm parameters in chinese and I am unable to decode them. I tried user defined function in Bigquery but it didn't work out. Here is the sample udf I tried

DECLARE uri STRING;
SET uri = 'https://www.random.cn/services/modifcation?utm_medium=cpc&utm_source=baidu&utm_term=%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF';


CREATE TEMP FUNCTION DecodeKeywords(encodedKeyword STRING) RETURNS STRING LANGUAGE js AS R"""
try {
    return decodeURIComponent(encodedKeyword);
  } catch (error) {
    // Handle the error gracefully
    return encodedKeyword ;
  }
""";

select uri, DecodeKeywords(uri) as decoded_uri  

I also, tried the same with urllib.parse.unquote in python ( solution generated by chatGPT)

import urllib.parse

# Example URL with encoded characters
url = "https://www.random.cn/services/modifcation?utm_medium=cpc&utm_source=baidu&utm_term=%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF"

try:
  # Decode the URL with utf-8 encoding (common for web)
  decoded_url = urllib.parse.unquote(url, encoding='utf-8')
except UnicodeDecodeError:
  # If decoding with utf-8 fails, try using latin-1 encoding (fallback)
  decoded_url = urllib.parse.unquote(url, encoding='latin-1')

# Print the decoded URL
print(decoded_url)

But I am unable to translate it into proper keyword, however if I paste the utm_term(%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF) into chatGPT it shows me the output("设施周期开始")

I tried both User Defined function in BigQuery as well as urllib.parse.unquote in python.


Solution

  • Found the answer, the term is decoded multiple times, hence has to do 3 layer of decoding to get the Chinese term.

    DECLARE uri STRING; SET uri = 'https://www.random.cn/services/modifcation?utm_medium=cpc&utm_source=baidu&utm_term=%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF';
    
    
    CREATE TEMP FUNCTION DecodeKeywords(encodedKeyword STRING) RETURNS STRING LANGUAGE js AS R""" try {
        return decodeURIComponent(encodedKeyword);   } catch (error) {
    
        return encodedKeyword ;   } """;
    
    select uri, DecodeKeywords(DecodeKeywords(DecodeKeywords(uri))) as decoded_uri