Search code examples
pythonweb-scrapingscrapy

Scraping language change


I'm following a course. It's a bit outdated so some stuff changed on the website https://www.centris.ca/

Basically it's a real estate website, you need to call 2 endpoints to get a result with properties.

Call 1: https://www.centris.ca/property/UpdateQuery

Call 2: https://www.centris.ca/Property/GetInscriptions

Unfortunately, I cannot figure out how not to return the response in French. Things I tried:

headers: 'Accept-Language': 'en-US,en;q=0.6'

Cookies: Nothing that indicates language

Spider:

import scrapy
from scrapy.selector import Selector
import json


class ListingsSpider(scrapy.Spider):
    name = "listings"
    allowed_domains = ["www.centris.ca"]

    position = {
        "startPosition": 0
    }

    def start_requests(self):
        query = {
            "query":{
                "UseGeographyShapes":0,
                "Filters":[
                    
                ],
                "FieldsValues":[
                    {
                        "fieldId":"Category",
                        "value":"Commercial",
                        "fieldConditionId":"",
                        "valueConditionId":""
                    },
                    {
                        "fieldId":"SellingType",
                        "value":"Rent",
                        "fieldConditionId":"",
                        "valueConditionId":""
                    },
                    {
                        "fieldId":"RentPrice",
                        "value":0,
                        "fieldConditionId":"ForRent",
                        "valueConditionId":""
                    },
                    {
                        "fieldId":"RentPrice",
                        "value":999999999999,
                        "fieldConditionId":"ForRent",
                        "valueConditionId":""
                    }
                ]
            },
            "isHomePage": True
        }
        yield scrapy.Request(
            url="https://www.centris.ca/property/UpdateQuery",
            method="POST",
            body=json.dumps(query),
            headers={
                'Content-Type': 'application/json',
                'Content-Language': 'en'
            },
            callback=self.update_query
        )


    def update_query(self, response):
        yield scrapy.Request(
            url="https://www.centris.ca/Property/GetInscriptions",
            method="POST",
            body=json.dumps(self.position),
            headers={
                'Content-Type': 'application/json',
                'accept-language': 'en-US,en;q=0.6',
                'referer': 'https://www.centris.ca/en/properties~for-rent?view=Thumbnail',
                'cache-control': 'no-cache'
            },
            cookies={'currency': 'USD', 'country': 'UY'},
            callback=self.parse
        )

    def parse(self, response):
       resp_dict = json.loads(response.body)
       html = resp_dict.get('d').get('Result').get('html')
       sel = Selector(text=html)
       listings = sel.xpath("//div[@class='property-thumbnail-item thumbnailItem col-12 col-sm-6 col-md-4 col-lg-3']")
       for listing in listings:
         print("not yet implemented")

Edit: I'm scraping the API, not the website itself.


Solution

  • The selected language is stored in the session on the server, you must send the current cookie to get the data of the selected language.

    You have to refactor the flow:

    First of all call centris.ca/en?uc=0 to set current locale to EN then save cookie in response to local

    Then send the saved cookie in the call: https://www.centris.ca/property/UpdateQuery, https://www.centris.ca/Property/GetInscriptions