Search code examples
google-bigqueryunicode-normalization

function to convert unicode in bigquery


I trying out the NORMALIZE function with NFKC in bigquery from the documentation, I see that I can convert a string to a readable format. For example

WITH EquivalentNames AS (
  SELECT name
  FROM UNNEST([
      'Jane\u2004Doe',
      '\u0026 Hello'
      ]) AS name
)
SELECT
  NORMALIZE(name, NFKC) AS normalized_str
FROM EquivalentNames
GROUP BY 1;

The ampersand character shows up correctly, but I have a table, with a column of STRING with unicode character in its values, but I'm not able to use NORMALIZE to convert it to a readable format.

I've also tried some of the other solutions presented Decode Unicode's to Local language in Bigquery but nothing is working yet.

Attached is an example of the data: enter image description here


Solution

  • You posted a question about NORMALIZE, but didn't make your goals clear.

    Here I'll answer the question about NORMALIZE - to point out that it probably doesn't do what you are expecting it to do. But at least it's acting as expected.

    There are many ways to encode the same string with Unicode. Normalize chooses one, while preserving the string.

    See this query:

    SELECT *, a=b ab, a=c ac, a=d ad, b=c bc, b=d bd, c=d cd
    FROM (
      SELECT NORMALIZE('hello ñá 😞', NFC) a
       , NORMALIZE('hello ñá 😞', NFKC) b
       , NORMALIZE('hello ñá 😞', NFD) c
       , NORMALIZE('hello ñá 😞', NFKD) d
    )
    

    enter image description here

    As you see - every time you get the same string, they just have different non-visible representations.