Search code examples
pythonweb-scrapingscrapy

Scraping Text through sections using scrapy


So i am currently using scrapy to scrape a website. The website has n number sublinks which i was able to enter. Each sublink has 3 things i need title, description and content. I am able to get title, description but the content is split across n number of section where number of section differ per sublink like in this example enter image description here

now i tried using loops to go through each section and store it but the yield functions gives me title,desc, and the content from the last section

below is code

def parse_instructions(self, response):
    title = response.xpath('//\*\[@id="d-article"\]/div\[1\]/div\[1\]/h1/text()').get()
    description = response.xpath('//\*\[@id="ency\_summary"\]/p/text()').getall()
    joined_description = ' '.join(description)
    sections = response.css('section div.section:not([class*=" "])')

    for section in sections:
        section_text = ' '.join(section.css('p::text').getall())
        section_text = ' '.join('a::text').getall()
        section_text = ' '.join('ul::text').getall()

    yield {
        "title": title,
        "description": joined_description,
        "section_text": section_text,
    }

Solution

  • Its because your selectors are not correct based on the data on the page. Edit them as below:

    def parse_instructions(self, response):
        title = response.css("h1::text").get()
        description = response.css("#ency_summary p::text").get()
        sections = response.xpath("//div[contains(@id,'section-')]//*/text()").getall()
        section_text = ''.join(sections)
    
        yield {
            "title": title,
            "description": description,
            "section_text": section_text,
        }
    

    Note: Avoid over complicated selectors as they are prone to fail more often.