Search code examples
pythonhtmlweb-scrapingpython-requestsscrapy

Efficiently scrape website by going through multiple different pages/categories


I am having difficulties advancing my current scraping project/idea. I am attempting to web scrape all products on an online shop by category. The website link is: https://eshop.nomin.mn/.

Currently, with the aid of the great developers on this forum, I have been able to scrape the Food/Grocery category successfully using the online shop's data API (My code is provided at the very bottom of my post). While I could duplicate this success for other categories by changing the data API URL, I believe it would be very insufficient and inefficient.

Ideally I want to scrape all categories of the website using one spider rather than making a spider for each category. I do not know how I should go around doing this as my previous projects the websites main page had all the products listed, whereas this does not. Furthermore, adding multiple Data API URLs does not seem to be working for me. Each category has a different URL and a different Data API, for example:

  1. Electric products (https://eshop.nomin.mn/6011.html)
  2. Food products (https://eshop.nomin.mn/n-foods.html)
  3. Building material (https://eshop.nomin.mn/n-building-materials-tools.html)
  4. Automobile products and parts (https://eshop.nomin.mn/n-autoparts-tools.html)
  5. etc

The image below shows how you can browse the website and the categories (translated to English).

enter image description here

Ideally my scrapped end product would be a long table such as this. I have included Original Price and Listed Price separately as some categories such as the electric products have two pricing HTML as shown below.

<div class="item-specialPricetag-1JM">
<span class="item-oldPrice-1sY">
<span>1</span>
<span>,</span>
<span>899</span>
<span>,</span>
<span>990</span>
<span>₮</span>
</span>
</div>

<div class="item-webSpecial-Z6W">
<span>1</span>
<span>,</span>
<span>599</span>
<span>,</span>
<span>990</span>
<span>₮</span>
</div>

enter image description here

My current working code that successfully scrapes the food product category and retrieves 3000+ products name, description, and price. Also I think since I will be scraping multiple pages/categories maybe having a rotating/random generated header/user-agent would be smart. What would be the best way to integrate this idea?

import scrapy
from scrapy import Request
from datetime import datetime

BASE_URL = "https://eshop.nomin.mn/graphql?query=query+category($pageSize:Int!$currentPage:Int!$filters:ProductAttributeFilterInput!$sort:ProductAttributeSortInput){products(pageSize:$pageSize+currentPage:$currentPage+filter:$filters+sort:$sort){items{id+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal{created_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename}new_to_date+short_description{html+__typename}productAttributes{name+value+__typename}price{regularPrice{amount{currency+value+__typename}__typename}__typename}special_price+special_to_date+thumbnail{file_small+url+__typename}url_key+url_suffix+mp_label_data{enabled+name+priority+label_template+label_image+to_date+__typename}...on+ConfigurableProduct{variants{product{sku+special_price+price{regularPrice{amount{currency+value+__typename}__typename}__typename}__typename}__typename}__typename}__typename}page_info{total_pages+__typename}total_count+__typename}}&operationName=category&variables="


dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' Nomin CPI Foods Data'

class NominCPIFoodsSpider(scrapy.Spider):
    name = 'nomin_cpi_foods'
    allowed_domains = ['https://eshop.nomin.mn/n-foods.html/']
    custom_settings = {
        "FEEDS": {
            f'{filename}.csv': {
                'format': 'csv',
                'overwrite': True}}
    }

    # function used for start url
    def start_requests(self):
        for i in range(50):
            url = BASE_URL + '{"currentPage":' + str(i) + ',"id":24175,"filters":{"category_id":{"in":"24175"}},"pageSize":50,"sort":{"position":"DESC"}}'
            yield Request(url, self.parse)

    # function to parse
    def parse(self, response, **kwargs):
        data = response.json()
        print(data.keys())
        for item in data['data']["products"]["items"]:
            yield {
                "name": item["name"],
                "price": item["price"]["regularPrice"]["amount"]["value"],
                "description": item["short_description"]["html"]
            }

        # handles pagination
        next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()
        if next_url:
            yield scrapy.Request(next_url, self.parse)


if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(NominCPIFoodsSpider)
    process.start()

Sorry for the long post, and all and any help is greatly appreciated. Thank you very much. ^^


Solution

  • What you can do is go to the website and visit each of the categories, grab the api url for that category, look to see how many pages of information that specific category has, and then extract the category ID out the URL and create a dictionary reference in your code that keeps the category ID as keys and the page number as values.

    Then in your start_requests method, instead of only having to substitute the current page with a variable you can do the same for the category. Then you can pretty much leave the rest unchanged.

    One thing that is unnecessary is to continue to parse the actual web pages themselves. ALl of the information you need is available from the API, so yielding requests for the different pages isn't really doing you any good.

    Here is an example using a handful of the categories available on the site.

    import scrapy
    from scrapy.crawler import CrawlerProcess
    from scrapy import Request
    from datetime import datetime
    
    categories = {
        "19653": 4,
        "24175": 67,
        "21297": 48,
        "19518": 16,
        "19487": 40,
        "26011": 46,
        "19767": 3,
        "19469": 5,
        "19451": 4
    }
    
    dt_today = datetime.now().strftime('%Y%m%d')
    filename = dt_today + ' Nomin'
    
    class Nomin(scrapy.Spider):
        name = 'nomin'
        custom_settings = {
            "FEEDS": {
                f'{filename}.csv': {
                    'format': 'csv',
                    'overwrite': True}}
        }
    
        def start_requests(self):
            for cat, pages in categories.items():
                for i in range(1, pages):
                    url = f'https://eshop.nomin.mn/graphql?query=query+category%28%24pageSize%3AInt%21%24currentPage%3AInt%21%24filters%3AProductAttributeFilterInput%21%24sort%3AProductAttributeSortInput%29%7Bproducts%28pageSize%3A%24pageSize+currentPage%3A%24currentPage+filter%3A%24filters+sort%3A%24sort%29%7Bitems%7Bid+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal%7Bcreated_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename%7Dnew_to_date+short_description%7Bhtml+__typename%7DproductAttributes%7Bname+value+__typename%7Dprice%7BregularPrice%7Bamount%7Bcurrency+value+__typename%7D__typename%7D__typename%7Dspecial_price+special_to_date+thumbnail%7Bfile_small+url+__typename%7Durl_key+url_suffix+mp_label_data%7Benabled+name+priority+label_template+label_image+to_date+__typename%7D...on+ConfigurableProduct%7Bvariants%7Bproduct%7Bsku+special_price+price%7BregularPrice%7Bamount%7Bcurrency+value+__typename%7D__typename%7D__typename%7D__typename%7D__typename%7D__typename%7D__typename%7Dpage_info%7Btotal_pages+__typename%7Dtotal_count+__typename%7D%7D&operationName=category&variables=%7B%22currentPage%22%3A{i}%2C%22id%22%3A{cat}%2C%22filters%22%3A%7B%22category_id%22%3A%7B%22in%22%3A%22{cat}%22%7D%7D%2C%22pageSize%22%3A50%2C%22sort%22%3A%7B%22news_from_date%22%3A%22ASC%22%7D%7D'
                    yield Request(url, self.parse)
    
        def parse(self, response, **kwargs):
            data = response.json()
            if data and data['data'] and data['data']['products'] and data['data']['products']['items']:
                for item in data['data']["products"]["items"]:
                    yield {
                        "name": item["name"],
                        "price": item["price"]["regularPrice"]["amount"]["value"],
                        "description": item["short_description"]["html"]
                    }
    
    if __name__ == "__main__":
        process = CrawlerProcess()
        process.crawl(Nomin)
        process.start()
    

    p.s. the value I have for the number of pages might not be accurate. I just used what was visible at the bottom of the first page. Some of the categories might have more pages.


    Edit:

    To send the categories with the request you simply need to store the category name in the dictionary with the id and number of pages, and then set it in the cb_kwargs parameter of each of the start_url requests.

    for example:

    categories = {
        "19653": {
            "pages": 4, 
            "name": "Food"
         },
         "33456": {
             "pages": 12,
             "name": "Outdoor"
         }
    }
    
    # This is fake information I made up for the example
    

    and then in your start requests_method:

    def start_requests(self):
        for cat, val in categories.items():
            for page in range(1, val["pages"]):
                url = .....
                yield scrapy.Request(
                    url, 
                    callback=self.parse, 
                    cb_kwargs={"category": val["name"]}
                 )
    

    Then in your parse method:

        def parse(self, response, category=None):
            data = response.json()
            if data and data['data'] and data['data']['products'] and data['data']['products']['items']:
                for item in data['data']["products"]["items"]:
                    yield {
                        "category": category,
                        "name": item["name"],
                        "price": item["price"]["regularPrice"]["amount"]["value"],
                        "special_price": item["special_price"],
                        "description": item["short_description"]["html"]
                    }