I am having some trouble scraping data from the following website:
https://www.loft.com.br/apartamentos/sao-paulo-sp?q=pin
When we load the page, it loads the first ~30 posts on real state in the city of Sao Paulo. If we scroll down, it loads more posts.
Usually I would use selenium to get around this - but I want to learn how to do it properly - I imagine that is by fiddling with requests.
By using inspect on chrome, and watching for what happens when we scroll down, I can see a request made which I presume is what retrieves the new posts.
If I copy its content as curl, I get the following command:
curl "https://landscape-api.loft.com.br/listing/search?city=S^%^C3^%^A3o^%^20Paulo^&facetFilters^\[^\]=address.city^%^3AS^%^C3^%^A3o^%^20Paulo^&limit=18^&limitedColumns=true^&loftUserId=417b37df-19ab-4014-a800-688c5acc039d^&offset=28^&orderBy^\[^\]=rankB^&orderByStatus=^%^27FOR_SALE^%^27^%^2C^%^20^%^27JUST_LISTED^%^27^%^2C^%^20^%^27DEMOLITION^%^27^%^2C^%^20^%^27COMING_SOON^%^27^%^20^%^2C^%^20^%^27SOLD^%^27^&originType=LISTINGS_LOAD_MORE^&q=pin^&status^\[^\]=FOR_SALE^&status^\[^\]=JUST_LISTED^&status^\[^\]=DEMOLITION^&status^\[^\]=COMING_SOON^&status^\[^\]=SOLD" ^
-X "OPTIONS" ^
-H "Connection: keep-alive" ^
-H "Accept: */*" ^
-H "Access-Control-Request-Method: GET" ^
-H "Access-Control-Request-Headers: loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id" ^
-H "Origin: https://www.loft.com.br" ^
-H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36" ^
-H "Sec-Fetch-Mode: cors" ^
-H "Sec-Fetch-Site: same-site" ^
-H "Sec-Fetch-Dest: empty" ^
-H "Referer: https://www.loft.com.br/" ^
-H "Accept-Language: en-US,en;q=0.9" ^
--compressed
I am unsure which would be the proper way to convert this to a command to be used in python module requests - so I used this website - https://curl.trillworks.com/ - to do it.
The result is:
import requests
headers = {
'Connection': 'keep-alive',
'Accept': '*/*',
'Access-Control-Request-Method': 'GET',
'Access-Control-Request-Headers': 'loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id',
'Origin': 'https://www.loft.com.br',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://www.loft.com.br/',
'Accept-Language': 'en-US,en;q=0.9',
}
params = (
('city', 'S\xE3o Paulo'),
('facetFilters/[/]', 'address.city:S\xE3o Paulo'),
('limit', '18'),
('limitedColumns', 'true'),
('loftUserId', '417b37df-19ab-4014-a800-688c5acc039d'),
('offset', '28'),
('orderBy/[/]', 'rankB'),
('orderByStatus', '\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\''),
('originType', 'LISTINGS_LOAD_MORE'),
('q', 'pin'),
('status/[/]', ['FOR_SALE', 'JUST_LISTED', 'DEMOLITION', 'COMING_SOON', 'SOLD']),
)
response = requests.options('https://landscape-api.loft.com.br/listing/search', headers=headers, params=params)
However, when I try to run it, I get a 204.
So my questions are:
Your way to find requests is correct. But you need to find and analyze correct requests.
About why you get 204 response code with no results; you send OPTION
requests instead of GET
. In Chrome DevTools you can see two similar requests (check attached picture). One is OPTION
and second one is GET
with type xhr.
For the website you need the second one, but you used OPTION
in your code requests.options(..)
To see response of the request select it and check response or preview tab.
One of the best HTTP libraries in Python is requests.
And here's complete code to get all search results:
import requests
headers = {
'x-user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_0) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/88.0.4324.146 Safari/537.36',
'utm_created_at': '',
'Accept': 'application/json, text/plain, */*',
}
with requests.Session() as s:
s.headers = headers
listings = list()
limit = 18
offset = 0
while True:
params = {
"city": "São Paulo",
"facetFilters/[/]": "address.city:São Paulo",
"limit": limit,
"limitedColumns": "true",
# "loftUserId": "a2531ad4-cc3f-49b0-8828-e78fb489def8",
"offset": offset,
"orderBy/[/]": "rankA",
"orderByStatus": "\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\'",
"originType": "LISTINGS_LOAD_MORE",
"q": "pin",
"status/[/]": ["FOR_SALE", "JUST_LISTED", "DEMOLITION", "COMING_SOON", "SOLD"]
}
r = s.get('https://landscape-api.loft.com.br/listing/search', params=params)
r.raise_for_status()
data = r.json()
listings.extend(data["listings"])
offset += limit
total = data["pagination"]["total"]
if len(data["listings"]) == 0 or len(listings) == total:
break
print(len(listings))