Search code examples
asp.netlinqxml-parsingextension-methodsxelement

How to get the inner text from an xml document using XDocument and extension methods


I am trying to get the "NAME" and "EMAIL" texts from the following html file:

<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title></title>
</head>
<body>
    <ol>
        <li>
            <font class="normal">
                <b>NAME</b> <a href="/member/mail_compose.aspx?id=name"><img src="/images/mailbox.gif" border="0" alt="Send Mail" /></a> <a href="/photos/member_viewphoto.aspx?id=name"><img src="/images/icons/member_photos.gif" border="0" alt="View  Photos" /></a> <br />
                ADDRESS<br />
                PHONE<br />
                <a href="mailto:[email protected]" class="redlink">EMAIL</a><br />
                <br />
            </font>
        </li>
</body>
</html>

Here is the code that I am using:

// Load the xml document
XDocument xDoc = XDocument.Load(@"..\..\Directory.html");

// Parse document
var names = xDoc.Root.DescendantsAndSelf()
        .Where(x => x.Name.LocalName == "ol").DescendantsAndSelf()
        .Where(x => x.Name.LocalName == "li").DescendantsAndSelf()
        .Select(x => new
                        {
                            name = x.Elements().Where(y => y.Name.LocalName == "b").Select(y => y.Value),
                            email = x.DescendantsAndSelf().Where(y => y.Name.LocalName == "a" && x.FirstAttribute.Name == "href" && x.Attribute("href").Value.Contains("mailto")).Select(y => y.Value ?? "No Email")
                        }
        );

// Print text to console
for (int i = 0; i < names.Count(); i++)
{
    Console.WriteLine("{0}: {1}", names.ElementAt(i).name, names.ElementAt(i).email);
}

Somehow, the above code is printing this:

System.Linq.Enumerable+WhereSelectEnumerableIterator2[System.Xml.Linq.XElement, System.String]: System.Linq.Enumerable+WhereSelectEnumerableIterator2[System.Xm l.Linq.XElement,System.String]

Could someone please tell me why this is happening? Also, if there is a better way of doing this, suggestions would be very welcome.


Solution

  • To answer your first question (which is probably more important to you than the code I have to get it working for that sample HTML), you have .Select for your name and email fields. That's why you're returning a collection when you loop over names. If that is actually what you want, then do a SelectMany instead of a Select when you create your anonymous object.

    Without a schema, I don't know how to better your XML traversing before the ".Select"

    Another issue is that for the href attribute, you need to compare to FirstAttribute.Name.LocalName instead of just the FirstAttribute.Name

    var names = xDoc.Root.DescendantsAndSelf()
                    .Where(x => x.Name.LocalName == "ol").DescendantsAndSelf()
                    .Where(x => x.Name.LocalName == "li").DescendantsAndSelf()
                    .Where(x => x.Name.LocalName == "font")
                    .Select(x => new
                    {
                        name = x.Descendants().Where(y => y.Name.LocalName == "b").Select(y => y.Value).Single(),
                        email = x.Descendants().Where(y => y.Name.LocalName == "a" && y.FirstAttribute.Name.LocalName == "href" && y.Attribute("href").Value.Contains("mailto")).Select(y => y.Value).Single()
                    });
    

    Some notes:

    y.Value ?? "No Email"
    

    needs to be redone because the y.Value will never be null
    also you were missing an ol tag in your html :)