Search code examples
pythoncssscrapy

Scrapy CSS selector to select one of multiple similar types tags


I would like to know to select data from a specific tags structure out of multiple similar tags structure. For example, consider below structures. Now i would like to only select data out of first "vcol" so that i can read ITEM1, ITEM2, ITEM3, ITEM4.

<div class="main">

    <div class="vcol">
        <div class="cls">
            <ul class="ul_style">
                <li class="li_style"> <h3 class="item"> ITEM1 </h3></li>
                <li class="li_style"> <h3 class="item"> ITEM2 </h3></li>
                <li class="li_style"> <h3 class="item"> ITEM3 </h3></li>
                <li class="li_style"> <h3 class="item"> ITEM4 </h3></li>
            </ul>
        </div>
    </div>

    <div class="vcol">
        <div class="cls">
            <ul class="ul_style">
                <li class="li_style"> <h3 class="item"> ITEM5 </h3></li>
                <li class="li_style"> <h3 class="item"> ITEM6 </h3></li>
                <li class="li_style"> <h3 class="item"> ITEM7 </h3></li>
                <li class="li_style"> <h3 class="item"> ITEM8 </h3></li>
            </ul>
        </div>
    </div>

    <div class="vcol">
        <div class="cls">
            <ul class="ul_style">
                <li class="li_style"> <h3 class="item"> ITEM9 </h3></li>
                <li class="li_style"> <h3 class="item"> ITEM10 </h3></li>
                <li class="li_style"> <h3 class="item"> ITEM11 </h3></li>
                <li class="li_style"> <h3 class="item"> ITEM12 </h3></li>
            </ul>
        </div>
    </div>
    
</div>

If i write scrapy code as below, i am getting all ITEMS1 - 12.

for item in response.css('.vcol .cls .ul_style li'):
    item.css('h2 ::text').extract_first()

Any suggestion, how to get only ITEM1-4 ?

Tried to loop through different classes, however i am always getting all items


Solution

  • As mentionned in my comments, you can use the :nth-child() CSS selector to only match the first <div class="vcol">.

    Then, I don't think it's worth doing two CSS queries as you effectively have one single ITEM* per <li>. So you could use .vcol:nth-child(1) h3 or .vcol:nth-child(1) .item to select the HTML tags you want to extract.