Search code examples
pythontextutf-8nlp

Detect / replace utf characters


I want to detect and/or replace weird utf, non-emoji characters that break my tokenization pipeline, like \uf0fc, which renders like a cup/glass: Rendering

That image / code is not contained in the emojis package, which I tried for filtering.

Is there a class that describes all such characters? Is there a way I can reliably detect them?


Solution

  • This is a character from a Private Use Area. It happens to look like a tankard in your font, but the Unicode standard doesn't mandate a specific look or meaning for these; it has whatever meaning you assign to it. The idea is that you agree upon a meaning with whoever you're communicating with - privately, meaning without getting the Unicode Consortium involved.

    You can use the standard unicodedata module to check whether a character is from the Co category, or just hardcode the ranges, as described here.