I want to detect and/or replace weird utf, non-emoji characters that break my tokenization pipeline, like \uf0fc
, which renders like a cup/glass:
That image / code is not contained in the emojis package, which I tried for filtering.
Is there a class that describes all such characters? Is there a way I can reliably detect them?
This is a character from a Private Use Area. It happens to look like a tankard in your font, but the Unicode standard doesn't mandate a specific look or meaning for these; it has whatever meaning you assign to it. The idea is that you agree upon a meaning with whoever you're communicating with - privately, meaning without getting the Unicode Consortium involved.
You can use the standard unicodedata
module to check whether a character is from the Co
category, or just hardcode the ranges, as described here.