Search code examples
phpregexencodingsublimetext3

remove all unexpected chars from Google translate


I'm using Google translate to translate some text.

Sometimes, the Google translator adds non-printable characters in the translated text.

For example, go to this page: https://www.google.com/search?client=ubuntu&channel=fs&q=traduttore&ie=utf-8&oe=utf-8

Choose from Italian to English and translate leone marino.

The result will be:

sea ​​lion
   ^ here there are other two non-printable chars, exactly before the "l" char

You can test it by putting the text anywhere you can change it (for example in a text editor or in a text field in any web page, or even in the browser url) and moving with the keyboard arrows you will notice that the cursor will stops twice more close to the character of the space.

Leaving aside the reason why these characters are inserted, how can I remove all these non-printable characters using a Regex with PHP and/or using sublime text?

And, how to see the unicode version of these characters?


Solution

  • To remove all other format Unicode chars you may use

    $s = preg_replace('~\p{Cf}+~u', '', $s);
    

    Since you want to remove a zero-width space, you may just use

    $s = str_replace("\u{200B}", "", $s);
    

    I use https://r12a.github.io/app-conversion/ (no affiliation) to check for hidden chars in strings:

    enter image description here

    Possible PHP code to convert a string to \uXXXX representation to quickly see the Unicode code points for non-ASCII chars:

    $input = "sea ​​lion";
    echo preg_replace_callback('#[^ -~]#u', function($m) {
        return substr(json_encode($m[0]), 1, -1);
    }, $input); 
    // => sea \u200b\u200blion