google-bigquery html-entities unicode-string

Convert HTML characters to unicode in BigQuery

In BigQuery, is there a way to replace HTML entity characters with their unicode equivalent?

For example, I have the following rows in a table:

id | text
1  | Hello World &#128540;
2  | Yes &#x1F61C; It works great &#128540;

And I would like to have:

id | text
1  | Hello World 😜
2  | Yes 😜 It works great 😜

Solution

The following general technique works:

Split the text on each character where an HTML entity character like 😜 is considered a single character
Keep track of character position with OFFSET
Rejoin all characters, but use some BigQuery STRING function magic to replace HTML entities with their unicode character.

SELECT
  id,
  ANY_VALUE(text) AS original,
  STRING_AGG(
    COALESCE(
      -- Support hex codepoints
      CODE_POINTS_TO_STRING(
        [CAST(CONCAT('0x', REGEXP_EXTRACT(char, r'(?:&#x)(\w+)(?:;)')) AS INT64)]
      ),
      -- Support decimal codepoints
      CODE_POINTS_TO_STRING(
        [CAST(CONCAT('0x', FORMAT('%x', CAST(REGEXP_EXTRACT(char, r'(?:&#)(\d+)(?:;)') AS INT64))) AS INT64)]
      ),
      -- Fall back to the character itself
      char
    ),
  '' ORDER BY char_position) AS text
FROM UNNEST([
  STRUCT(1 AS id, 'Hello World &#128540;' AS text),
  STRUCT(2 AS id, 'Yes &#x1F61C; It works great &#128540;'),
  STRUCT(3 AS id, '&#8212;' AS text),
  STRUCT(4 AS id, '&#x2014;' AS text)
])
CROSS JOIN
  -- Extract all characters individually except for HTML entity characters
  UNNEST(REGEXP_EXTRACT_ALL(text, r'(&#\w+;|.)')) char WITH OFFSET AS char_position
GROUP BY id