Search code examples
pythonhtmlselenium-webdriverweb-scraping

why does the html all have same class and sub-class with different information


I am trying to scrap the house type, EPC rating from the website.

but i noticed that after inspecting the html, house type e.g "freehold" , Epc rating e.g "D" all have the same Class name, and CSS selector

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import pandas as pd  # Ensure you import pandas
import time

# Initialize WebDriver
driver = webdriver.Chrome()

# Open URL
url = "https://www.zoopla.co.uk/house-prices/england/?new_homes=include&q=england+&orig_q=united+kingdom&view_type=list&pn=1"
driver.get(url)

# Wait for the main content to load (adjust time as needed)
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "_17smgnt0"))
)

# Initialize result list to store data
result = []

# Find all house elements
houses = driver.find_elements(By.CLASS_NAME, "_1hzil3o0")

# Extract and print addresses
for house in houses:
    try:
        item = {
            "address": house.find_element(By.XPATH, './/a/h2').text,
            "DateLast_sold": house.find_element(By.CSS_SELECTOR, "._1hzil3o9._1hzil3o8._194zg6t7").text,
            "Number of Rooms": house.find_element(By.CLASS_NAME, "_1pbf8i53").text,
            "EPC Rating": house.find_element(By.CLASS_NAME, "_14bi3x30").text
            
            
        }

        result.append(item)  # Append to the result list
    except Exception as e:
        print(f"Error extracting address or date: {e}")

# Store the result into a dataframe after the loop
df = pd.DataFrame(result)

# Show the result
print(df)

# Close the driver
driver.quit()

here is a picture of the html file, how can i extract the freehold and EPC rating to show the right information. enter image description here


Solution

  • They all have the same classes because they are all styled the same. That's why they all look the same on the page. I looked through the HTML as well and I don't see anything that indicates what is what. I would grab the list of button styled info and loop through it looking for known info. Seems like the first one is always the style. The second is generally the sqm which you can identify by checking if the string contains "sqm". The final one is EPC rating which you can identify by checking if the string contains "EPC rating". That should get you what you need.