Search code examples
c#xmlxmldocument

How to extract specific value from xml string?


I would like to extract 1st two sentences under the <P> tag.

for example(input string):

<P align=justify><STRONG>Pricings<BR></STRONG>It was another active week for names leaving the database. The week's prints consisted of two ILS, and sever ITS.</P>

required output string:

It was another active week for names leaving the database. The week's prints consisted of two ILS, and sever ITS. 

Currently, my function below is throwing the following error:

System.Xml.XmlException: 'justify' is an unexpected token. The expected token is '"' or ''

price = bottom.Substring(bottom.IndexOf("Pricings"), 8);

XmlDocument doc = new XmlDocument();
doc.LoadXml(bottom);


XmlNodeList pList = doc.SelectNodes("/P[@align='justify']/strong");

foreach (XmlNode pValue in pList)
{
    string innerText = pValue.ChildNodes[0].InnerText;
    innerText = result;
}

I am little unclear, how to go about resolving this issue. Thank you for any further help.


Solution

  • It is not XML string, but HTML one.

    Since HTML itself often can be not well-formed (and in your case it is not well-formed) , generally you can't use XML parsers to parse HTML.

    Instead you can use HTML Agility Pack (recommended way), or parse this text using regular expressions (generally not recommended, but sometimes possible).

    Here is the sample code how to get youd data using HtmlAgility pack:

    var s = "<P align=justify><STRONG>Pricings<BR></STRONG>It was another active week for names leaving the database. The week's prints consisted of two ILS, and sever ITS.</P>";
    
    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(s);
    
    string result;
    var p = doc.DocumentNode.SelectSingleNode("p");
    if (p.ChildNodes.Count == 2)
        result = p.ChildNodes[1].InnerText;
    

    Note: Html Agility pack also available as NuGet package in Visual Studio.