Search code examples
.netxmlencoding.net-2.0

Best way to encode text data for XML


I was looking for a generic method in .Net to encode a string for use in an Xml element or attribute, and was surprised when I didn't immediately find one. So, before I go too much further, could I just be missing the built-in function?

Assuming for a moment that it really doesn't exist, I'm putting together my own generic EncodeForXml(string data) method, and I'm thinking about the best way to do this.

The data I'm using that prompted this whole thing could contain bad characters like &, <, ", etc. It could also contains on occasion the properly escaped entities: &amp;, &lt;, and &quot;, which means just using a CDATA section may not be the best idea. That seems kinda klunky anyay; I'd much rather end up with a nice string value that can be used directly in the xml.

I've used a regular expression in the past to just catch bad ampersands, and I'm thinking of using it to catch them in this case as well as the first step, and then doing a simple replace for other characters.

So, could this be optimized further without making it too complex, and is there anything I'm missing? :

Function EncodeForXml(ByVal data As String) As String
    Static badAmpersand As new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)")

    data = badAmpersand.Replace(data, "&amp;")

    return data.Replace("<", "&lt;").Replace("""", "&quot;").Replace(">", "gt;")
End Function

Sorry for all you C# -only folks-- I don't really care which language I use, but I wanted to make the Regex static and you can't do that in C# without declaring it outside the method, so this will be VB.Net

Finally, we're still on .Net 2.0 where I work, but if someone could take the final product and turn it into an extension method for the string class, that'd be pretty cool too.

Update The first few responses indicate that .Net does indeed have built-in ways of doing this. But now that I've started, I kind of want to finish my EncodeForXml() method just for the fun of it, so I'm still looking for ideas for improvement. Notably: a more complete list of characters that should be encoded as entities (perhaps stored in a list/map), and something that gets better performance than doing a .Replace() on immutable strings in serial.


Solution

  • System.XML handles the encoding for you, so you don't need a method like this.