I'm running some XML through a SAX parser and have noticed the parser is not functioning correctly with certain characters as data content. The XML is supposed to be in UTF-8 encoding and the SAX parser is set to process that encoding.
Narrowing down problematic strings and looking at the XML file in a hex editor I can see for example that 2C10 causes a problem, if I change this instead to C2A2 (an example character given on wikipedia) then the SAX parser works. So is 2C10 not a valid UTF8 character?
U+2C10 is GLAGOLITIC CAPITAL LETTER NASHI
. Here are its properties:
U+2C10 ‹Ⱀ› \N{GLAGOLITIC CAPITAL LETTER NASHI}
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InGlagolitic Glagolitic
Is_Glagolitic Cased Cased_Letter LC Changes_When_Casefolded
CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased
CWL Changes_When_NFKC_Casefolded CWKCF Lu L Glag Gr_Base
Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS
Letter L_ Uppercase_Letter Print Upper Uppercase Word
XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
Age=4.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L
Block=Glagolitic Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR
Canonical_Combining_Class=NR General_Category=Cased_Letter
Decomposition_Type=None DT=None East_Asian_Width=Neutral
GC=LC General_Category=L General_Category=Letter
General_Category=L_ General_Category=LC GC=L
General_Category=Lu General_Category=Uppercase_Letter GC=Lu
Script=Glagolitic Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup
Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL
Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=4.1 IN=4.1
Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2
IN=5.2 Present_In=6.0 IN=6.0 Script=Glag SC=Glag
Sentence_Break=UP Sentence_Break=Upper SB=UP
Word_Break=ALetter WB=LE Word_Break=LE _X_Begin