Search code examples
rubyregexunicodeemoji

How do I remove emoji from string


My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex:

REGEX = /[^\u1F600-\u1F6FF\s]/i

This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?


Solution

  • Karol S already provided a solution, but the reason might not be clear:

    "\u1F600" is actually "\u1F60" followed by "0":

    "\u1F60"    # => "ὠ"
    "\u1F600"   # => "ὠ0"
    

    You have to use curly braces for code points above FFFF:

    "\u{1F600}" #=> "😀"
    

    Therefore the character class [\u1F600-\u1F6FF] is interpreted as [\u1F60 0-\u1F6F F], i.e. it matches "\u1F60", the range "0".."\u1F6F" and "F".

    Using curly braces solves the issue:

    /[\u{1F600}-\u{1F6FF}]/
    

    This matches (emoji) characters in these unicode blocks:


    You can also use unpack, pack, and between? to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.

    s = 'Hi!😀'
    #=> "Hi!\360\237\230\200"
    
    s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
    #=> "Hi!" 
    

    Regarding your Rubular exampleEmoji are single characters:

    "😀".length  #=> 1
    "😀".chars   #=> ["😀"]
    

    Whereas kaomoji are a combination of multiple characters:

    "^_^".length #=> 3
    "^_^".chars  #=> ["^", "_", "^"]
    

    Matching these is a very different task (and you should ask that in a separate question).