Search code examples
c#xml-deserializationstring-parsing

C# - Deserializing when whitespace between tags is delimited


I am posting some XML to an API Gateway method in AWS, which has an integration to SNS. An SQS queue is then subscribed to the topic; and I have a C# process which polls the queue intermittently and needs to deserialize the XML.

The trouble is, the whitespace between the XML tags ends up getting encoded along the line somewhere, so tabs become \t and new lines become \r\n. But these end up as physical tokens inside the string.

Example XML which is posted to API Gateway:

<?xml version="1.0" encoding="utf-8"?>
<ProfileInformation>
    <Username>bgs264</Username>
</ProfileInformation>

String which is read off the SQS queue:

<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<ProfileInformation>\n\t<Username>bgs264</Username>\n</ProfileInformation>

Note that the attributes in the declaration end up as \" and the whitespace posted ends up as \t, \r\n, etc.

However these aren't "the strings appearing as such in the debugger, but it's actually a tab", they are actually like this in the string.

So when I try to deserialize, using

using (var reader = new StringReader(message))
   var myObj = serializer.Deserialize(reader) as ProfileInformation);

I get:

InvalidOperationException: There is an error in XML document (1, 15).

It refers to the first \ character in the declaration, as in version=\"1.0\"

My immediate idea was to simply string.Replace \t to empty string, etc, but that's unacceptable because it might be valid that the user's username is actually is bgs\t264 and the replace here would cause an inconsistency. In this example, I presume I would get bgs\\t264 in the message, so a replace would leave me, erroneously, with bgs\264 for example.

So I need to fix these \n\t characters where they occur between XML tags.

For what it's worth, I also have a lambda written in Go which has no problem with this and simply deserializes the exact same string straight into XML. So it must be possible.

My intial thoughts:

  • Can I somehow decode the string before passing it for deserialization? I tried this with HttpUtility.DecodeHtml but I don't think it's actually HTML that I'm trying to decode!
  • Is there a different XML library I can use that would work?

Solution

  • I would guess, and some googling seems to support the theory, that the message you're seeing has been converted to JSON & the escape sequences are as a consequence of that.

    The ideal approach would be to investigate and prevent this from happening. I don't know enough about SNS to advise & you indicate this is a non-starter, so the simplest approach would be to reverse this process once you receive the message.

    You can use a JSON library like Json.NET to do this:

    var jsonString = string.Format("\"{0}\"", message);
    
    var xmlString = JsonConvert.DeserializeObject<string>(jsonString);
    
    using (var reader = new StringReader(xmlString))
    {
       var profileInformation = (ProfileInformation) serializer.Deserialize(reader);
    }