Search code examples
c#htmlxpathhtml-parsinghtml-agility-pack

Get value between html tags Xpath and HtmlAgility


So Far I am trying to retrieve the text between HTML tags for a certain website....

Say for instance I need to extract out the text between these span tags how would I go about that, I am receiving an error stating "the object reference not set to an instance of an object" here is the HTML

There is also HTML Code prior to this portion here; I don't know if that should make a difference.

<div class="thumbnail-details">
<ul>
    <li> … </li>
    <li class="product-title">
        <span class="thumbnail-details-grey">The Blaster Portable Wireless Speaker in Black</span>
    </li>
    <li> … </li>
</ul>
</div>

So far my C# code is

    HtmlWeb hw = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument htmlDoc = hw.Load(@"http://www.karmaloop.com/Browse.htm#Pgroup=1");
        if (htmlDoc.DocumentNode != null)
        {
            foreach (HtmlNode text in htmlDoc.DocumentNode.SelectNodes("//span[@class='thumbnail-details-grey']/text()"))
            {
                Console.WriteLine(text.InnerText);
            }

Can I get some help here, I want to extract out "The Blaster Portable Wireless Speaker in Black".


Solution

  • Your code works just fine, but you'll have to load the right page to get it to work. The page you are loading uses an ajax request to load the results you see in your browser.

    So instead of the url you are currently using you have to use:

    HtmlDocument htmlDoc = hw.Load(@"http://www.karmaloop.com/Browse?Pgroup=1&ajax=true&version=2");
    

    Then your code works. I'm still looking for the place this request gets put together...

    But the query looks rather easy to guess. For example the page http://www.karmaloop.com/Browse.htm#Pdept=11&PageSize=30&Pgroup=1 request the url http://www.karmaloop.com/Browse?Pdept=11&PageSize=30&Pgroup=1&ajax=true&version=2. So all you have to do is use your url and build a new one starting after the #.