I have this code and I want to iterate over the "list_of_urls", but I don't know how to call this in the "url" variable. Is there a way to pass this list and iterate over the pageNumber?
import scrapy
import json
list_of_urls = []
for i in range(1,3):
url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i)
to_append = [url]
for j in to_append:
list_of_urls.append(j)
print(list_of_urls)
class TestSpider(scrapy.Spider):
name = "test"
headers = {
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
def start_requests(self):
yield scrapy.Request(
url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber=7&pageSize=42',
callback= self.parse,
method= "GET",
headers= self.headers
)
def parse(self, response):
pass
json_response = json.loads(response.text)
res = json_response["result"]["items"]
for item in res:
yield {
'lat': item['realEstate']['address']['geoLocation']['lat'],
'lon': item['realEstate']['address']['geoLocation']['lon'],
'price': item['realEstate']['price']
}
Yes, there are many ways to do this.
One way would be to simply use a for loop and iterate of the list_of_urls variable inside of your start_requests
method.
Example:
...
list_of_urls = []
for i in range(1,3):
url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i)
list_of_urls.append(url)
print(list_of_urls)
...
...
def start_requests(self):
for url in list_of_urls:
yield scrapy.Request(
url = url,
callback= self.parse,
method= "GET",
headers= self.headers)
Another would be to simply move your list_of_urls code inside of the start_requests
method:
def start_requests(self):
for i in range(1,3):
url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i)
yield scrapy.Request(url=url, headers=self.headers)
Some Additional tips:
You can use the custom_settings
to set the USER_AGENT
setting instead of setting it in the headers for every request.
As you can see in my first example you were unnecessarily adding the url to a list and then iterating that list to append it the list_of_urls
when you could have just simply appended the url to the list.
The "GET" method is default for scrapy requests so there is no need to set it explicitly, and the same is true for the callback and self.parse
, it will choose it by default.
In your parse method you can simply use response.json()
instead of json_response = json.loads(response.text)
.
Using all of the above your code could look something like this.
import scrapy
class TestSpider(scrapy.Spider):
name = "test"
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
def start_requests(self):
for i in range(1, 3):
yield scrapy.Request('https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i))
def parse(self, response):
for item in response.json()["result"]["items"]:
yield {
'lat': item['realEstate']['address']['geoLocation']['lat'],
'lon': item['realEstate']['address']['geoLocation']['lon'],
'price': item['realEstate']['price']
}