Search code examples
pythonscrapypython-requestsscrapy-shell

Scrapy extracting <li> with span inside


I'm trying to extract the text from this html structure:

<div class="col-6 col-lg-3">
    <span class="font-weight-bold">List of Birds</span>
        <ul class="bird-forms">
            <li>Crow <span class="color">Black</span></li>
            <li>Peacock <span class="color">Multicolored</span></li>
            <li>Dove <span class="color">Multicolored</span></li>
            <li>Sparrow <span class="color">Brown</span></li>
            <li>Goose <span class="color">Multicolored</span></li>
            <li>Ostrich <span class="color">Multicolored</span></li>
        </ul>
</div>

Using scrapy shell: response.css('ul.bird-forms li ::text').extract()

I want to the result to look like this:

['Crow Black', 
 'Peacock Multicolored',
 'Dove Multicolored', 
 'Sparrow Brown', 
 'Goose Multicolored',
 'Ostrich Multicolored']

Instead of this:

['Crow',
 'Black', 
 'Peacock',
 'Multicolored', 
 'Dove', 
 'Multicolored', 
 'Sparrow', 
 'Brown',
 'Goose', 
 'Multicolored',
 'Ostrich', 
 'Multicolored']

Solution

  • Simply use XPath string():

    birds = []
    for li in response.xpath('//ul[@class="bird-forms"]/li'):
        bird = li.xpath('string(.)').get()
        birds.append(bird)