When I want to run a scrapy spider, I could do it by calling either scrapy.cmdline.execute(['scrapy', 'crawl', 'myspider'])
or os.system('scrapy crawl myspider')
or subprocess.run(['scrapy', 'crawl', 'myspider'])
.
My question is: Why would I prefer to use scrapy.cmdline.execute over subprocess.run or os.system?
I haven't found a word in the docs of scrapy about this function, neither does it have a docstring, but I see that it's actively used in some tutorials and code examples.
Using os.system
or subprocess.run
both run the command in a subprocess, where as with scrapy.cmdline.execute
you are calling the scrapy entrypoint function directly and all of the code is then executed in the same process as the script that called the function.
Python officially recommends using the subprocess
module over calls to os.system
as a general rule, (see the documentation for os.system
for more information) and the subprocess
api is easier to use and offers more control, so the os.system
option shouldn't really be considered.
For the other two, while I am sure there are a multitude of reasons to choose one over the other, I wouldn't recommend using either of these methods. Scrapy offers plenty of tools that help with executing spiders from scripts such as CrawlerProcess
and CrawlerRunner
that should make it unnecessary to access the CLI from a subprocess, or call the CLI entry point function directly from your script. (although I am sure there are plenty of exceptions to this)
Instead I recommend using the CLI tool as a CLI tool, and use the CrawlerProcess
or similar when needing control scrapy via python code.
See Running scrapy from a script to learn more about how to run scrapy from python code.