Search code examples
pythonweb-scrapingscrapy

Scrapy not giving output but works in shell


So i'm new to web scraping and i've been trying to scrape this website but the problem i'm encountering is that when i run my code in shell it runs but it doesn't when i run it as a whole scrapy code.

Here is my Scrapy code block:

import scrapy


class MedspiderSpider(scrapy.Spider):
    name = "medspider"
    allowed_domains = ["www.1mg.com"]
    start_urls = ["https://www.1mg.com/drugs-all-medicines"]

    def parse(self, response):
        meds = response.css('div.style__flex-1___A_qoj')
        
        for med in meds:
            yield{
                'name' : med.css('div div::text').get(),
                'price' : med.css('div:has(> span)::text').getall()[-1],
                'strip content' :  med.css('div::text').getall()[-4],
                'manufacturer' :  med.css('div::text').getall()[-3]
            }

Running this code using scrapy crawl medspider it gives the following output:-

2023-05-28 11:44:22 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: tata1mg)
2023-05-28 11:44:22 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 23.1.1 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform Windows-10-10.0.22624-SP0
2023-05-28 11:44:22 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tata1mg',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'tata1mg.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tata1mg.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-05-28 11:44:22 [asyncio] DEBUG: Using selector: SelectSelector
2023-05-28 11:44:22 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-05-28 11:44:22 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2023-05-28 11:44:22 [scrapy.extensions.telnet] INFO: Telnet Password: 74ffa37a1d23bc25
2023-05-28 11:44:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2023-05-28 11:44:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-05-28 11:44:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-05-28 11:44:22 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-05-28 11:44:22 [scrapy.core.engine] INFO: Spider opened
2023-05-28 11:44:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-05-28 11:44:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-05-28 11:44:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.1mg.com/robots.txt> (referer: None)
2023-05-28 11:44:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.1mg.com/drugs-all-medicines> (referer: None)
2023-05-28 11:44:23 [scrapy.core.engine] INFO: Closing spider (finished)
2023-05-28 11:44:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 896,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 48250,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.670526,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 5, 28, 6, 14, 23, 763883),
 'httpcompression/response_bytes': 264903,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 5,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2023, 5, 28, 6, 14, 23, 93357)}
2023-05-28 11:44:23 [scrapy.core.engine] INFO: Spider closed (finished)

But when i run the code in shell it gives the output like:-

In [21]: med.css('div div div ::text').get()
Out[21]: 'Augmentin 625 Duo Tablet'
In [25]: med.css('div::text').getall()[-4]
Out[25]: 'strip of 10 tablets'
In [26]: med.css('div::text').getall()[-3]
Out[26]: 'Glaxo SmithKline Pharmaceuticals Ltd'

Solution

  • Either in your settings.py file, or using your spiders custom_settings attribute, assign a custom USER_AGENT value.

    For example:

    settings.py

    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
    

    or in your spider module:

    ...
    
    class MedspiderSpider(scrapy.Spider):
        name = "medspider"
        allowed_domains = ["www.1mg.com"]
        start_urls = ["https://www.1mg.com/drugs-all-medicines"]
        custom_settings = {
            "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
        }
    
    ...
    

    And then try running scrapy crawl ... again