Search code examples
cssscrapycss-selectors

Select a group of elements and text using css selectors


I have an HTML page like:-

<div>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
</div>

I need to select a group like this:-

<a href='link'>
<u class>name</u>
</a>
text
<br>

I need to select 3 values from a group:- link, name, and text. Is there any way to select a group like this, and extract these particular values from each group in scrapy using, CSS selectors, Xpath, or anything?


Solution

  • Scrapy provides a mechanism to yield multiple values on the html page using Items- as items, Python objects that define key-value pairs.

    You can extract individually and but yield them together as key-value pairs.

    • to extract value of an attribute of an element, use attr().
    • to extract innerhtml, use text.

    Like you can define your parse function in scrapy like this:

    def parse(self, response):
          
            for_link = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8)  a::attr(href)').getall()
                
            for_name = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a u::text').getall()
                  
            for_text =  response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8)::text').getall()
                 
                # Yield all elements
                yield {"link": for_link, "name": for_name, "text": for_text}
    
    

    Open the items.py file.

    # Define here the models for your scraped
    # items
    # Import the required library
    import scrapy
     
    # Define the fields for Scrapy item here
    # in class
    class <yourspider>Item(scrapy.Item):
         
        # Item key for a
        for_link = scrapy.Field()
         
        # Item key for u
        for_name = scrapy.Field()
         
        # Item key for span
        for_text = scrapy.Field()
    
    

    for more details, read this tutorial