Search code examples
c#linq-to-xmlxelement

How to get the unescaped length of XElement inner text?


I try to parse the following Java resources file - which is an XML. I am parsing using C# and XDocument tools, so not a Java question here.

<?xml version="1.0" encoding="utf-8"?>
  <resources>
    <string name="problem">&#160;test&#160;</string>
    <string name="no_problem"> test </string>
  </resources>

The problem is that XDocument.Load(string path) method load this as an XDocument with 2 identical XElements.

I load the file.

string filePath = @"c:\res.xml"; // whatever
var xDocument = XDocument.Load(filePath);

When I parse the XDocument object, here is the problem.

foreach (var node in xDocument.Root.Nodes())
{
    if (node.NodeType == XmlNodeType.Element)
    {
        var xElement = node as XElement;
        if (xElement != null) // just to be sure
        {
            var elementText = xElement.Value;
            Console.WriteLine("Text = '{0}', Length = {1}", 
                elementText, elementText.Length);
        }
    }
}

This produces the following 2 lines :

"Text = ' test ', Length = 6" 
"Text = ' test ', Length = 6"

I want to get the following 2 lines :

"Text = ' test ', Length = 6"
"Text = '&#160;test&#160;', Length = 16"

Document encoding is UTF8, if this is relevant somehow.


Solution

  • string filePath = @"c:\res.xml"; // whatever
    var xDocument = XDocument.Load(filePath);
    String one = (xDocument.Root.Nodes().ElementAt(0) as XElement).Value;//< test >
    String two = (xDocument.Root.Nodes().ElementAt(1) as XElement).Value;//< test >
    Console.WriteLine(one == two); //false  
    Console.WriteLine(String.Format("{0} {1}", (int)one[0], (int)two[0]));//160 32
    

    You have two different strings, and &#160; is there, but in unicode format. One possible way to get things back is manually replace non-breaking space to "&#160;"

    String result = one.Replace(((char) 160).ToString(), "&#160;");