Search code examples
pythonscrapy

How do I indicate path to proxylist when using get_project_settings() in Scrapy


I am trying to run my spider from my script. It runs fine from command prompt and it runs fine from the script if I don't use my proxies (except I get 403's because I'm not using proxies).

I have tried changing my filepath, but none worked.

In settings.py I simply use

ROTATING_PROXY_LIST_PATH = 'proxylist'

This is my scapy.cfg, I tried changing 'scraper' to scraper.scraper for the heck of it, but didn't work.

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper

This is my project structure

  • rascraper
    • scraper
      • spiders
        • init.py
        • Spider.py
      • init.py
      • items.py
      • middewares.py
      • pipelines.py
      • settings.py
      • scraper
      • scrapy.cfg
      • proxylist

I don't think including the spider is relevant, but this is how I call it (in the same file)

if __name__ == '__main__':

    process = CrawlerProcess(get_project_settings())
    process.crawl('Acts', artist="eddiem")
    process.start()

Why does scrapy not find my proxyfile when calling the settings via get_project_settings()?


Solution

  • Your scrapy.cfg needs to be moved to it's parent directory.

    According to the scrapy docs.

    Though it can be modified, all Scrapy projects have the same file structure by default, similar to this:

    scrapy.cfg
    myproject/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            spider1.py
            spider2.py
            ...
    

    The directory where the scrapy.cfg file resides is known as the project root directory. That file contains the name of the python module that defines the project settings. Here is an example:

    [settings]
    default = myproject.settings
    

    Which means the scrapy.cfg file should be at least one directory above the the project directory/directory with the settings.py file.