Search code examples
pythonweb-scrapingscrapy

How do I scrape multiple pages with different alphabetical labels using scrapy


I am working on a web scraping project using Scrapy and I am facing an issue with iterating through multiple pages with alphabet labels.

My goal is to scrape medicine data from the website https://www.1mg.com/drugs-all-medicines by iterating through alphabet labels and then looping through the pages within each alphabet. Here's what I have so far:

import scrapy
import re

class MedSpider(scrapy.Spider):
    name = "medspider"
    allowed_domains = ["www.1mg.com"]
    start_urls = ["https://www.1mg.com/drugs-all-medicines"]
    current_alphabet = 'a'  # Initial alphabet label
    current_page = 2  # Initial page number

    def parse(self, response):
        meds = response.css('div.style__flex-1___A_qoj')

        for med in meds:
            yield {
                'name': med.css('div div::text').get(),
                'price': med.css('div:has(> span)::text').getall()[-1],
                'strip content': med.css('div::text').getall()[-4],
                'manufacturer': med.css('div::text').getall()[-3],
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None and self.current_page <= 1076:
            url_page = 'https://www.1mg.com/drugs-all-medicines?page=' + str(self.current_page)
            self.current_page += 1  # Increment the page number
            yield response.follow(url_page, callback=self.parse)
        else:
            if self.current_alphabet < 'z':
                self.current_alphabet = chr(ord(self.current_alphabet) + 1)  # Increment the alphabet label
                self.current_page = 2  # Reset the page number
                url_label = 'https://www.1mg.com/drugs-all-medicines?label=' + self.current_alphabet
                yield response.follow(url_label, callback=self.parse)

Now when i run the code, the first Label and all its pages are captured but when the code shifts for second label 'b' it just scrapes the first page of 'b' and then stops the iteration with this message

DEBUG: Filtered duplicate request: <GET https://www.1mg.com/drugs-all-medicines?page=2> - no more duplicate

The problem comes with different URL:

URL for label 'a' first page - https://www.1mg.com/drugs-all-medicines

second page - https://www.1mg.com/drugs-all-medicines?page=2

third page - https://www.1mg.com/drugs-all-medicines?page=3 and more

whereas URL for label 'b' and onwards first page - https://www.1mg.com/drugs-all-medicines?label=b

second page - https://www.1mg.com/drugs-all-medicines?page=2&label=b

third page - https://www.1mg.com/drugs-all-medicines?page=2&label=b and onwards

What changes should i make in my code?


Solution

  • What seemed to work for me was setting the label and page number as callback args in each request and fabricating url at the end of each parse method call with those values by incrementing the page number and copying the label. When I finally stopped the spider it had already produced 42K unique items.

    For example:

    import scrapy
    import string
    
    
    class MedspiderSpider(scrapy.Spider):
        base = "https://www.1mg.com/drugs-all-medicines?label="
        name = "medspider"
        allowed_domains = ["www.1mg.com"]
        custom_settings = {
            "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
        }
    
        def start_requests(self):
            for char in string.ascii_lowercase:
                yield scrapy.Request(self.base + char, cb_kwargs={"page": 1, "label": char})
    
        def parse(self, response, page, label):
            for med in response.css('div.style__flex-1___A_qoj'):
                yield {
                    'name': med.css('div div::text').get(),
                    'price': med.css('div:has(> span)::text').getall()[-1],
                    'strip content': med.css('div::text').getall()[-4],
                    'manufacturer': med.css('div::text').getall()[-3],
                }
            next_page = self.base + label + f"&page={page+1}"
            yield scrapy.Request(next_page, cb_kwargs={"page": page+1, "label": label})