pythonparsingweb-scrapingbeautifulsouppython-requests

Problem extracting text from h1 tag Beautiful Soup


I'm a complete newbie to parsing websites but I've had a script that pulls the figures of different housing sites which worked flawlessly for the past year. However, for a reason I can't figure out, it no longer works on daft.ie anymore. I've tried to debug but nothing I try seems to work. I either get 'list index out of range' or 'None' which I know it indicates the array is empty but its clearly not. Below is a snippet of the problem-some code.

Would appreciate someone who has more knowledge than I to have a look as I'm sure its going to be something which should be obvious.

Appreciate all the assistance from the site.

import sys
import requests
from bs4 import BeautifulSoup

def get_buy_numbers_dublin_city():
    page = requests.get("https://www.daft.ie/property-for-sale/dublin-city")
    soup = BeautifulSoup(page.content, 'html.parser')

    prop_num = str(soup.find_all(class_="styles__SearchH1-sc-1t5gb6v-3 guZHZl")[0])
    prop_num = prop_num.replace('<h1 class="styles__SearchH1-sc-1t5gb6v-3 guZHZl" data-testid="search-h1">', '')
    prop_num = prop_num.replace(' Properties for Sale in Dublin City</h1>', '')
    prop_num = prop_num.replace(',', '')
    return(prop_num)

def main(argv):

    print(get_buy_numbers_dublin_city())

if __name__ == "__main__":
    main(sys.argv[1:])

Solution

  • One issue would be that this site is also protecting its content, so you always should take a closer look into response text or soup, because in this case non of the content you would expect is in the HTML.

    You could add an user-agent to avoid these behavior for some time or use selenium and co. to mimic browser. Be aware that if some other of your scraping behavior is detected, server may block you again.

    Example
    import requests
    from bs4 import BeautifulSoup
    
    page = requests.get("https://www.daft.ie/property-for-sale/dublin-city", headers={'user-agent':'some-agent'})
    soup = BeautifulSoup(page.content)
    
    print(soup.h1.text.split()[0])
    

    Will give you:

    2,544