Search code examples
unicodetext-parsingcodepoint

Differenciate between symbol, number and letter-codepoints in Unicode?


Unicode has a huge number of codepoints, how can I check wheter a codepoint is a symbol (like "!" or "☭"), a number (like "4" or "৯"), a letter (like "a" or "え") or a control character (are usually not displayed directly)?

Is there any logic behind the position of the character and what kind of character it is (as opposed to just what alphabet it is part of), if not, are there any existing resources which classify which ranges are what?


Solution

  • That would be done through the General Category property of those codepoints. It's part of the canonical UnicodeData.txt dataset, and every serious Unicode-related library should have some way for you to get this property.