Search code examples
pythondebuggingweb-scrapingpython-requestsweb-inspector

Web Scraping Identifying executing and troubleshooting a request


I am having some trouble scraping data from the following website:

https://www.loft.com.br/apartamentos/sao-paulo-sp?q=pin

When we load the page, it loads the first ~30 posts on real state in the city of Sao Paulo. If we scroll down, it loads more posts.

Usually I would use selenium to get around this - but I want to learn how to do it properly - I imagine that is by fiddling with requests.

By using inspect on chrome, and watching for what happens when we scroll down, I can see a request made which I presume is what retrieves the new posts.

enter image description here

If I copy its content as curl, I get the following command:

curl "https://landscape-api.loft.com.br/listing/search?city=S^%^C3^%^A3o^%^20Paulo^&facetFilters^\[^\]=address.city^%^3AS^%^C3^%^A3o^%^20Paulo^&limit=18^&limitedColumns=true^&loftUserId=417b37df-19ab-4014-a800-688c5acc039d^&offset=28^&orderBy^\[^\]=rankB^&orderByStatus=^%^27FOR_SALE^%^27^%^2C^%^20^%^27JUST_LISTED^%^27^%^2C^%^20^%^27DEMOLITION^%^27^%^2C^%^20^%^27COMING_SOON^%^27^%^20^%^2C^%^20^%^27SOLD^%^27^&originType=LISTINGS_LOAD_MORE^&q=pin^&status^\[^\]=FOR_SALE^&status^\[^\]=JUST_LISTED^&status^\[^\]=DEMOLITION^&status^\[^\]=COMING_SOON^&status^\[^\]=SOLD" ^
  -X "OPTIONS" ^
  -H "Connection: keep-alive" ^
  -H "Accept: */*" ^
  -H "Access-Control-Request-Method: GET" ^
  -H "Access-Control-Request-Headers: loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id" ^
  -H "Origin: https://www.loft.com.br" ^
  -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36" ^
  -H "Sec-Fetch-Mode: cors" ^
  -H "Sec-Fetch-Site: same-site" ^
  -H "Sec-Fetch-Dest: empty" ^
  -H "Referer: https://www.loft.com.br/" ^
  -H "Accept-Language: en-US,en;q=0.9" ^
  --compressed

I am unsure which would be the proper way to convert this to a command to be used in python module requests - so I used this website - https://curl.trillworks.com/ - to do it.

The result is:

import requests

headers = {
    'Connection': 'keep-alive',
    'Accept': '*/*',
    'Access-Control-Request-Method': 'GET',
    'Access-Control-Request-Headers': 'loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id',
    'Origin': 'https://www.loft.com.br',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-site',
    'Sec-Fetch-Dest': 'empty',
    'Referer': 'https://www.loft.com.br/',
    'Accept-Language': 'en-US,en;q=0.9',
}

params = (
    ('city', 'S\xE3o Paulo'),
    ('facetFilters/[/]', 'address.city:S\xE3o Paulo'),
    ('limit', '18'),
    ('limitedColumns', 'true'),
    ('loftUserId', '417b37df-19ab-4014-a800-688c5acc039d'),
    ('offset', '28'),
    ('orderBy/[/]', 'rankB'),
    ('orderByStatus', '\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\''),
    ('originType', 'LISTINGS_LOAD_MORE'),
    ('q', 'pin'),
    ('status/[/]', ['FOR_SALE', 'JUST_LISTED', 'DEMOLITION', 'COMING_SOON', 'SOLD']),
)

response = requests.options('https://landscape-api.loft.com.br/listing/search', headers=headers, params=params)

However, when I try to run it, I get a 204.

So my questions are:

  1. What is the proper/best way to identify requests from this website? Are there any better alternatives to what I did?
  2. Once identified, is copy as curl the best way to replicate the command?
  3. How to best replicate the command in Python?
  4. Why am I getting a 204?

Solution

  • Your way to find requests is correct. But you need to find and analyze correct requests.
    About why you get 204 response code with no results; you send OPTION requests instead of GET. In Chrome DevTools you can see two similar requests (check attached picture). One is OPTION and second one is GET with type xhr.
    For the website you need the second one, but you used OPTION in your code requests.options(..) enter image description here To see response of the request select it and check response or preview tab. enter image description here

    One of the best HTTP libraries in Python is .

    And here's complete code to get all search results:

    import requests
    
    headers = {
        'x-user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_0) AppleWebKit/537.36 (KHTML, like Gecko) '
                        'Chrome/88.0.4324.146 Safari/537.36',
        'utm_created_at': '',
        'Accept': 'application/json, text/plain, */*',
    }
    
    with requests.Session() as s:
        s.headers = headers
    
        listings = list()
        limit = 18
        offset = 0
        while True:
            params = {
                "city": "São Paulo",
                "facetFilters/[/]": "address.city:São Paulo",
                "limit": limit,
                "limitedColumns": "true",
                # "loftUserId": "a2531ad4-cc3f-49b0-8828-e78fb489def8",
                "offset": offset,
                "orderBy/[/]": "rankA",
                "orderByStatus": "\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\'",
                "originType": "LISTINGS_LOAD_MORE",
                "q": "pin",
                "status/[/]": ["FOR_SALE", "JUST_LISTED", "DEMOLITION", "COMING_SOON", "SOLD"]
            }
            r = s.get('https://landscape-api.loft.com.br/listing/search', params=params)
            r.raise_for_status()
    
            data = r.json()
            listings.extend(data["listings"])
    
            offset += limit
            total = data["pagination"]["total"]
            if len(data["listings"]) == 0 or len(listings) == total:
                break
    
    print(len(listings))