I'm creating my first scrapy project with Splash and work with the testdata from http://quotes.toscrape.com/js/
I want to store the quotes of each page as a separate file on disk (in the code below I first try to store the entire page). I have the code below, which worked when I was not using SplashRequest
, but with the new code below, nothing is stored on disk now when I 'Run and debug' this code in Visual Studio Code.
Also self.log
does not write to my Visual Code Terminal window. I'm new to Splash, so I'm sure I'm missing something, but what?
Already checked here and here.
import scrapy
from scrapy_splash import SplashRequest
class QuoteItem(scrapy.Item):
author = scrapy.Field()
quote = scrapy.Field()
class MySpider(scrapy.Spider):
name = "jsscraper"
start_urls = ["http://quotes.toscrape.com/js/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')
def parse(self, response):
for q in response.css("div.quote"):
quote = QuoteItem()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
yield quote
#cycle through all available pages
for a in response.css('ul.pager a'):
yield SplashRequest(url=a,callback=self.parse,endpoint='render.html',args={ 'wait': 0.5 })
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
UPDATE 1
How I debug it:
Output tab is empty
Terminal tab contains:
PS C:\scrapy\tutorial> cd 'c:\scrapy\tutorial'; & 'C:\Users\Mark\AppData\Local\Programs\Python\Python38-32\python.exe' 'c:\Users\Mark\.vscode\extensions\ms-python.python-2020.9.114305\pythonFiles\lib\python\debugpy\launcher' '58582' '--' 'c:\scrapy\tutorial\spiders\quotes_spider_js.py'
PS C:\scrapy\tutorial>
Also, nothing is logged in my Docker container instance, which I thought was required for Splash to work in the first place.
UPDATE 2
I ran scrapy crawl jsscraper
and a file 'quotes-js.html' is stored on disk. However, it contains the page source HTML without any JavaScript code executed. I'm looking to execute the JS code on 'http://quotes.toscrape.com/js/' and store only the quote content. How can I do so?
WRITING OUTPUT TO A JSON FILE:
I have tried to solve your problem. Here is the working version of your code. I hope this is what you are trying to achieve.
import json
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = "jsscraper"
start_urls = ["http://quotes.toscrape.com/js/page/"+str(i+1) for i in range(10)]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url=url,
callback=self.parse,
endpoint='render.html',
args={'wait': 0.5}
)
def parse(self, response):
quotes = {"quotes": []}
for q in response.css("div.quote"):
quote = dict()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
quotes["quotes"].append(quote)
page = response.url[response.url.index("page/")+5:]
print("page=", page)
filename = 'quotes-%s.json' % page
with open(filename, 'w') as outfile:
outfile.write(json.dumps(quotes, indent=4, separators=(',', ":")))
UPDATE: Above code has been updated to scrape from all pages and save results in separate json files from page-1 to 10.
This will write the list of quotes from each page to a separate json file as following:
{
"quotes":[
{
"author":"Albert Einstein",
"quote":"\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
},
{
"author":"J.K. Rowling",
"quote":"\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"
},
{
"author":"Albert Einstein",
"quote":"\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"
},
{
"author":"Jane Austen",
"quote":"\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
"author":"Marilyn Monroe",
"quote":"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"
},
{
"author":"Albert Einstein",
"quote":"\u201cTry not to become a man of success. Rather become a man of value.\u201d"
},
{
"author":"Andr\u00e9 Gide",
"quote":"\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"
},
{
"author":"Thomas A. Edison",
"quote":"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"
},
{
"author":"Eleanor Roosevelt",
"quote":"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"
},
{
"author":"Steve Martin",
"quote":"\u201cA day without sunshine is like, you know, night.\u201d"
}
]
}