We gather lots of strings and send them to our clients in xml fragments. These strings could contain literally any character. We've been seeing an error caused by trying to serialize XElement instances that contain "bad" characters. Here's an example:
var message = new XElement("song");
char c = (char)0x1a; //sub
var someData = string.Format("some{0}stuff", c);
var attr = new XAttribute("someAttr", someData);
message.Add(attr);
string msgStr = message.ToString(SaveOptions.DisableFormatting); //exception here
The code above generates an exception at the indicated line. Here's the stacktrace:
'SUB', hexadecimal value 0x1A, is an invalid character. System.ArgumentException System.ArgumentException: '', hexadecimal value 0x1A, is an invalid character. at System.Xml.XmlEncodedRawTextWriter.InvalidXmlChar(Int32 ch, Char* pDst, Boolean entitize) at System.Xml.XmlEncodedRawTextWriter.WriteAttributeTextBlock(Char* pSrc, Char* pSrcEnd) at System.Xml.XmlEncodedRawTextWriter.WriteString(String text) at System.Xml.XmlWellFormedWriter.WriteString(String text) at System.Xml.XmlWriter.WriteAttributeString(String prefix, String localName, String ns, String value) at System.Xml.Linq.ElementWriter.WriteStartElement(XElement e) at System.Xml.Linq.ElementWriter.WriteElement(XElement e) at System.Xml.Linq.XElement.WriteTo(XmlWriter writer) at System.Xml.Linq.XNode.GetXmlString(SaveOptions o)
My suspicion is that this is not the correct behaviour and the bad char should be escaped into the XML. Whether this is desirable or not is a question I will answer later.
So here's the question:
Is there some way of treating strings such that this error might not occur, or should I simply strip all chars below char 0x20
and cross my fingers?
This is what I am using in my code:
static Lazy<Regex> ControlChars = new Lazy<Regex>(() => new Regex("[\x00-\x1f]", RegexOptions.Compiled));
private static string FixData_Replace(Match match)
{
if ((match.Value.Equals("\t")) || (match.Value.Equals("\n")) || (match.Value.Equals("\r")))
return match.Value;
return "&#" + ((int)match.Value[0]).ToString("X4") + ";";
}
public static string Fix(object data, MatchEvaluator replacer = null)
{
if (data == null) return null;
string fixed_data;
if (replacer != null) fixed_data = ControlChars.Value.Replace(data.ToString(), replacer);
else fixed_data = ControlChars.Value.Replace(data.ToString(), FixData_Replace);
return fixed_data;
}
All characters bellow 0x20 (except \r \n \t)are replaced by their XML unicode codes: 0x1f => "f". Xml parser should automatically unescape it back to 0x1f when reading file. Just use new XAttribute("attribute", Fix(yourString))
It works for XElement content a it should probably also work for XAttributes.