I am working on a web scraping project using Scrapy and I am facing an issue with iterating through multiple pages with alphabet labels.
My goal is to scrape medicine data from the website https://www.1mg.com/drugs-all-medicines by iterating through alphabet labels and then looping through the pages within each alphabet. Here's what I have so far:
import scrapy
import re
class MedSpider(scrapy.Spider):
name = "medspider"
allowed_domains = ["www.1mg.com"]
start_urls = ["https://www.1mg.com/drugs-all-medicines"]
current_alphabet = 'a' # Initial alphabet label
current_page = 2 # Initial page number
def parse(self, response):
meds = response.css('div.style__flex-1___A_qoj')
for med in meds:
yield {
'name': med.css('div div::text').get(),
'price': med.css('div:has(> span)::text').getall()[-1],
'strip content': med.css('div::text').getall()[-4],
'manufacturer': med.css('div::text').getall()[-3],
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None and self.current_page <= 1076:
url_page = 'https://www.1mg.com/drugs-all-medicines?page=' + str(self.current_page)
self.current_page += 1 # Increment the page number
yield response.follow(url_page, callback=self.parse)
else:
if self.current_alphabet < 'z':
self.current_alphabet = chr(ord(self.current_alphabet) + 1) # Increment the alphabet label
self.current_page = 2 # Reset the page number
url_label = 'https://www.1mg.com/drugs-all-medicines?label=' + self.current_alphabet
yield response.follow(url_label, callback=self.parse)
Now when i run the code, the first Label and all its pages are captured but when the code shifts for second label 'b' it just scrapes the first page of 'b' and then stops the iteration with this message
DEBUG: Filtered duplicate request: <GET https://www.1mg.com/drugs-all-medicines?page=2> - no more duplicate
The problem comes with different URL:
URL for label 'a' first page - https://www.1mg.com/drugs-all-medicines
second page - https://www.1mg.com/drugs-all-medicines?page=2
third page - https://www.1mg.com/drugs-all-medicines?page=3 and more
whereas URL for label 'b' and onwards first page - https://www.1mg.com/drugs-all-medicines?label=b
second page - https://www.1mg.com/drugs-all-medicines?page=2&label=b
third page - https://www.1mg.com/drugs-all-medicines?page=2&label=b and onwards
What changes should i make in my code?
What seemed to work for me was setting the label and page number as callback args in each request and fabricating url at the end of each parse method call with those values by incrementing the page number and copying the label. When I finally stopped the spider it had already produced 42K unique items.
For example:
import scrapy
import string
class MedspiderSpider(scrapy.Spider):
base = "https://www.1mg.com/drugs-all-medicines?label="
name = "medspider"
allowed_domains = ["www.1mg.com"]
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
}
def start_requests(self):
for char in string.ascii_lowercase:
yield scrapy.Request(self.base + char, cb_kwargs={"page": 1, "label": char})
def parse(self, response, page, label):
for med in response.css('div.style__flex-1___A_qoj'):
yield {
'name': med.css('div div::text').get(),
'price': med.css('div:has(> span)::text').getall()[-1],
'strip content': med.css('div::text').getall()[-4],
'manufacturer': med.css('div::text').getall()[-3],
}
next_page = self.base + label + f"&page={page+1}"
yield scrapy.Request(next_page, cb_kwargs={"page": page+1, "label": label})