Search code examples
javascriptsqlregexsnowflake-cloud-data-platformemoji

How do I detect an emoji in a Snowflake VARCHAR?


I have a 1B row table of chat messages with a max-sized VARCHAR column for the raw message text. How do I -

  1. Select just the rows that contain 1 or more emoji characters
  2. Efficiently filter out rows without emoji (if needed to improve performance)

Solution

  • Combining the knowledge shared by Lukasz and Greg, plus discovering an additional hidden character, I can present this solution:

    • 1 JS UDF that detects the presence of emojis.
    • 1 JS UDF that detects if the string is only emojis.
    create or replace function has_emoji(X text)
    returns boolean
    language javascript
    as $$
    return /\p{Extended_Pictographic}/u.test(X);
    $$;
    
    create or replace function only_emoji(X text)
    returns boolean
    language javascript
    as $$
    return /^[\p{Extended_Pictographic}\p{Emoji_Component}]+$/u.test(X);
    $$;
    

    Sample use:

    with data as (
      select $1 t
      from values('❄️'),('❄️ is the data ☁️'),('no emoji')
    )
    
    select *, has_emoji(t), only_emoji(t)
    from data
    ;
    

    enter image description here

    --

    Some research:

    • '❄️'.replace(/\p{Extended_Pictographic}/ug, '') returns ''
    • but '❄️'.replace(/\p{Extended_Pictographic}/ug, '') is not equal to ''
    • Turns out there are hidden characters that still need to be replaced, and matched by Emoji_Component

    enter image description here