Search code examples
phpunicodepreg-replacepreg-match

preg_replace unicode characters


I have several strings which contain unicode. I've been tasked with stripping out everything from these strings except the unicode, so for example, below

\ud83d\ude82 + \u2600\ufe0f = \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29

would become

\ud83d\ude82 \u2600\ufe0f \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29

I then need to look for repeating codes, and seperate them so that:

 \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29

becomes:

\ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29

I've tried several preg_match solutions for the first bit, but it either doesn't remove any characters from the string, or removes everything. Below is the latest attempt,

/(^\\\u[0-9a-f]{4})+/

Not being too familiar with Regex, I'm starting to scratch my head in confusion as I'm not really sure what else to try.

This is so that eventually, I'm able to insert each unicode into a database as its own record.


Solution

  • It could be done in two steps:

    $str = '\ud83d\ude82 + \u2600\ufe0f = \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29';
    // remove non unicode character
    $str = preg_replace('/(?<=\\\\u[a-f0-9]{4})[^\\\\]+/', '', $str);
    // insert space between repeated pair
    $str = preg_replace('/((?:\\\u[a-f0-9]{4}){2})(?=\1)/', '$1 ', $str);
    echo $str,"\n";
    

    Output:

    \ud83d\ude82\u2600\ufe0f\ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29
    

    Regex #1:

    /                       : regex delimiter
      (?<=                  : lookahead
        \\\\u[a-f0-9]{4}    : unicode character
      )                     : end lookahead
      [^\\\\]+              : 1 or more any character that is NOT a backslash
    /                       : regex delimiter
    

    Regex #2:

    /                       : regex delimiter
      (                     : start group 1
        (?:                 : non capture group
          \\\\u[a-f0-9]{4}  : a unicode character
        ){2}                : appears twice (2 unicode characters)
      )                     : end group 1
      (?=\1)                : lookahead, group 1 is repeated
    /                       : regex delimiter