I have a fairly basic Python script that scrapes a property website, and stores the address and price in a csv file. There are over 5000 listings to go through but I find my current code times out after a while (about 2000 listings) and the console shows 302 and CORS policy errors.
import requests
import itertools
from bs4 import BeautifulSoup
from csv import writer
from random import randint
from time import sleep
from datetime import date
url = "https://www.propertypal.com/property-for-sale/northern-ireland/page-"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
filename = date.today().strftime("ni-listings-%Y-%m-%d.csv")
with open(filename, 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Address', 'Price']
thewriter.writerow(header)
# for page in range(1, 3):
for page in itertools.count(1):
req = requests.get(f"{url}{page}", headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
for li in soup.find_all('li', class_="pp-property-box"):
title = li.find('h2').text
price = li.find('p', class_="pp-property-price").text
info = [title, price]
thewriter.writerow(info)
sleep(randint(1, 5))
# this script scrapes all pages and records all listings and their prices in daily csv
As you can see I added sleep(randint(1, 5))
to add random intervals but I possibly need to do more. Of course I want to scrape the page in its entirety as quickly as possible but I also want to be respectful to the site that is being scraped and minimise burdening them.
Can anyone suggest updates? Ps forgive rookie errors, very new to Python/scraping!
This is one way of getting that data - bear in mind there are 251 pages only, with 12 properties on each of them, not over 5k:
import requests
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'accept': 'application/json',
'accept-language': 'en-US,en;q=0.9',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin'
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 252)):
soup = bs(s.get(f'https://www.propertypal.com/property-for-sale/northern-ireland/page-{x}').text, 'html.parser')
# print(soup)
properties = soup.select('li.pp-property-box')
for p in properties:
name = p.select_one('h2').get_text(strip=True) if p.select_one('h2') else None
url = 'https://www.propertypal.com' + p.select_one('a').get('href') if p.select_one('a') else None
price = p.select_one('p.pp-property-price').get_text(strip=True) if p.select_one('p.pp-property-price') else None
big_list.append((name, price, url))
big_df = pd.DataFrame(big_list, columns = ['Property', 'Price', 'Url'])
print(big_df)
Result printed in terminal:
100%
251/251 [03:41<00:00, 1.38it/s]
Property Price Url
0 22 Erinvale Gardens, Belfast, BT10 0FS Asking price£165,000 https://www.propertypal.com/22-erinvale-gardens-belfast/777820
1 Laurel Hill, 37 Station Road, Saintfield, BT24 7DZ Guide price£725,000 https://www.propertypal.com/laurel-hill-37-station-road-saintfield/751274
2 19 Carrick Brae, Burren Warrenpoint, Newry, BT34 3TH Guide price£265,000 https://www.propertypal.com/19-carrick-brae-burren-warrenpoint-newry/775302
3 7b Conway Street, Lisburn, BT27 4AD Offers around£299,950 https://www.propertypal.com/7b-conway-street-lisburn/779833
4 Hartley Hall, Greenisland From£280,000to£397,500 https://www.propertypal.com/hartley-hall-greenisland/d850
... ... ... ...
3007 8 Shimna Close, Newtownards, BT23 4PE Offers around£99,950 https://www.propertypal.com/8-shimna-close-newtownards/756825
3008 7 Barronstown Road, Dromore, BT25 1NT Guide price£380,000 https://www.propertypal.com/7-barronstown-road-dromore/756539
3009 39 Tamlough Road, Randalstown, BT41 3DP Offers around£425,000 https://www.propertypal.com/39-tamlough-road-randalstown/753299
3010 Glengeen House, 17 Carnalea Road, Fintona, BT78 2BY Offers over£180,000 https://www.propertypal.com/glengeen-house-17-carnalea-road-fintona/750105
3011 Walnut Road, Larne, BT40 2WE Offers around£169,950 https://www.propertypal.com/walnut-road-larne/749733
3012 rows × 3 columns
See relevant documentation for Requests: https://requests.readthedocs.io/en/latest/
For Pandas: https://pandas.pydata.org/docs/
For BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/
And for TQDM: https://pypi.org/project/tqdm/