Search code examples
xmlunicodeutf-8saxsaxparser

IS 2C10 a valid UTF-8 character?


I'm running some XML through a SAX parser and have noticed the parser is not functioning correctly with certain characters as data content. The XML is supposed to be in UTF-8 encoding and the SAX parser is set to process that encoding.

Narrowing down problematic strings and looking at the XML file in a hex editor I can see for example that 2C10 causes a problem, if I change this instead to C2A2 (an example character given on wikipedia) then the SAX parser works. So is 2C10 not a valid UTF8 character?


Solution

  • U+2C10 is GLAGOLITIC CAPITAL LETTER NASHI. Here are its properties:

    U+2C10 ‹Ⱀ› \N{GLAGOLITIC CAPITAL LETTER NASHI}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
    All Any Alnum Alpha Alphabetic Assigned InGlagolitic Glagolitic
       Is_Glagolitic Cased Cased_Letter LC Changes_When_Casefolded
       CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased
       CWL Changes_When_NFKC_Casefolded CWKCF Lu L Glag Gr_Base
       Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS
       Letter L_ Uppercase_Letter Print Upper Uppercase Word
       XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
       X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
    Age=4.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L
       Block=Glagolitic Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR
       Canonical_Combining_Class=NR General_Category=Cased_Letter
       Decomposition_Type=None DT=None East_Asian_Width=Neutral
       GC=LC General_Category=L General_Category=Letter
       General_Category=L_ General_Category=LC GC=L
       General_Category=Lu General_Category=Uppercase_Letter GC=Lu
       Script=Glagolitic Grapheme_Cluster_Break=Other GCB=XX
       Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
       Hangul_Syllable_Type=Not_Applicable HST=NA
       Joining_Group=No_Joining_Group JG=NoJoiningGroup
       Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL
       Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
       Numeric_Value=NaN NV=NaN Present_In=4.1 IN=4.1
       Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2
       IN=5.2 Present_In=6.0 IN=6.0 Script=Glag SC=Glag
       Sentence_Break=UP Sentence_Break=Upper SB=UP
       Word_Break=ALetter WB=LE Word_Break=LE _X_Begin