I don't know how to frame this question exactly. I am beginner at web scraping and I am trying to crawl a website using Python Scrapy. The website is dynamic and uses javascript and can't retrieve any data using the basic level xpath and CSS selectors.
I am trying to mimic the API request through my spider by requesting the url which has the data in json object. That request url is throwing a HTTP status code is not handled or not allowed error. I think I am calling the wrong URL. 9/10 times this method of calling the json object url directly has worked for me. What can I do different? the url has parameters and form data items in the headers section and the url doesn't even look like a valid website url it starts with https://ih3kc909gb-dsn.algolia.net/1/indexes.... I know this is a long question but I could really use some help with how to get a response for this?
You should use start_requests()
method instead of start_urls
property. You can read more about it from here . Now, all you need to do is to make a POST request.
Code
import scrapy
class carswitch(scrapy.Spider):
name = 'car'
headers = {
"Connection": "keep-alive",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"sec-ch-ua": "\" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"",
"accept": "application/json",
"sec-ch-ua-mobile": "?0",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"content-type": "application/x-www-form-urlencoded",
"Origin": "https://carswitch.com",
"Sec-Fetch-Site": "cross-site",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://carswitch.com/",
"Accept-Language": "en-US,en;q=0.9"
}
body = '{"params":"query=&hitsPerPage=24&page=0&numericFilters=%5B%22country_id%3D1%22%2C%22used_car%20%3D%201%22%5D&facetFilters=&typoTolerance=&tagFilters=%5B%5D&attributesToHighlight=%5B%5D&attributesToRetrieve=%5B%22make%22%2C%22make_ar%22%2C%22model%22%2C%22model_ar%22%2C%22year%22%2C%22trim%22%2C%22displayTrim%22%2C%22colorPaint%22%2C%22bodyType%22%2C%22salePrice%22%2C%22transmissionType%22%2C%22GPS%22%2C%22carID%22%2C%22inspectionID%22%2C%22inspectionStatus%22%2C%22rate%22%2C%22certified_dealer_id%22%2C%22dealer_category%22%2C%22used_car%22%2C%22new%22%2C%22top_condition%22%2C%22featured%22%2C%22photo%22%2C%22modifiedPlace%22%2C%22city%22%2C%22mileage%22%2C%22urgent_sales%22%2C%22price_dropped%22%2C%22urgent_sales_days%22%2C%22urgent_sales_end_date%22%2C%22date%22%2C%22negotiable%22%2C%22oldPrice%22%2C%22zero_downpayment%22%2C%22cashOnly%22%2C%22hasPriceGuidance%22%2C%22dealerOffer%22%2C%22maxPrice%22%2C%22fairPrice%22%2C%22pricey_deal%22%2C%22fair_deal%22%2C%22good_deal%22%2C%22great_deal%22%2C%22dealership_info%22%2C%22logo_small%22%2C%22GCCspecs%22%2C%22country%22%2C%22export%22%2C%22monthly_price%22%5D"}'
def start_requests(self):
url = 'https://ih3kc909gb-dsn.algolia.net/1/indexes/All_Carswitch_Cars/query?x-algolia-agent=Algolia%20for%20JavaScript%20(3.33.0)%3B%20Browser&x-algolia-application-id=IH3KC909GB&x-algolia-api-key=493a9bbc57331df3b278fa39c1dd8f2d'
yield Request(url=url, method='POST', headers=self.headers, body=self.body, callback=self.parse)
def parse(self,response):
print(response.body)