Search code examples
c#htmlasp.net.netcsquery

CSQuery select elements


I have an html file as following:

<h3>
    <div id='type'>
        Type 1
    </div>

    <div id='price'>
        127.76;
    </div>
</h3>

 <h3>
    <div id='type'>
        Type 2
    </div>

    <div id='price'>
        127.76;
    </div>
</h3>

Now I want to use CSQuery to extract those types and price into a List, here is the code I'm working on :

var doc = CQ.Create(htmlfile);

var types= (from listR in doc["<h3>"] //get the h3 tag
    select new TypeTest
    {
        Typename =  listR.GetAttribute("#type"),
        Price = listR.GetAttribute("#price")
    }
    ).ToList();
return types;

However, I couldn't get the details as I wish, as I'm not sure about the doc[] value when I put it as h3. the html file cannot be modified.


Solution

  • The html that you are parsing is an invalid format i.e. multiple identical id's. (There are two id='type' and id='price), you must take the following steps.

    1. Load the dom
    2. Load the collections of type and price divs separately.
    3. Use the Zip function to join them back together and project into you TypeTest object.

    Below is a working example:

    // 1
    var doc = CQ.Create(html);
    
    // 2
    var typeDivs = doc["h3 > div#type"];
    var priceDivs = doc["h3 > div#price"];
    
    // 3
    var types = typeDivs.Zip(priceDivs, (k, v) => new { k, v })
         .Select(h => 
          new TypeTest { Typename = h.k.InnerText.Trim(), 
          Price = h.v.InnerText.Trim() });