I'm using Google translate to translate some text.
Sometimes, the Google translator adds non-printable characters in the translated text.
For example, go to this page: https://www.google.com/search?client=ubuntu&channel=fs&q=traduttore&ie=utf-8&oe=utf-8
Choose from Italian to English and translate leone marino
.
The result will be:
sea lion
^ here there are other two non-printable chars, exactly before the "l" char
You can test it by putting the text anywhere you can change it (for example in a text editor or in a text field in any web page, or even in the browser url) and moving with the keyboard arrows you will notice that the cursor will stops twice more close to the character of the space.
Leaving aside the reason why these characters are inserted, how can I remove all these non-printable characters using a Regex with PHP and/or using sublime text?
And, how to see the unicode version of these characters?
To remove all other format Unicode chars you may use
$s = preg_replace('~\p{Cf}+~u', '', $s);
Since you want to remove a zero-width space, you may just use
$s = str_replace("\u{200B}", "", $s);
I use https://r12a.github.io/app-conversion/ (no affiliation) to check for hidden chars in strings:
Possible PHP code to convert a string to \uXXXX
representation to quickly see the Unicode code points for non-ASCII chars:
$input = "sea lion";
echo preg_replace_callback('#[^ -~]#u', function($m) {
return substr(json_encode($m[0]), 1, -1);
}, $input);
// => sea \u200b\u200blion