Search code examples
pythonseleniumselenium-webdriverwebautomation

Checking Websites for Updates (Web Automation with Python + Selenium)


I am trying to write a simple script that does the following:

  1. Runs automatically every 6 hrs
  2. Checks real estate website for new listings
  3. Email new listings details if any found, else terminate script until next run

I plan on using crontab to execute (1). Additionally, this is the script I have come up so far for one specific website:

from selenium import webdriver
import smtplib
import sys

driver = webdriver.Firefox()

#Capital Pacific Website
#Commerical Real Estate

#open text file containing property titles we already know about
properties = open("properties.txt", "r+")
currentList = []
for line in properties:
    currentList.append(line)

#to search for new listings
driver.get("http://cp.capitalpacific.com/Properties")

assert "Capital" in driver.title

#holds any new listings
newProperties = []

#find all listings on page by Property Name
newList = driver.find_elements_by_class_name('overview')

#find elements in pageList not in oldList & add to newList
#add new elements to 
for x in currentList:
    for y in newList:
        if y != x:
            newProperties.append(y)
            properties.write(y)

properties.close()
driver.close()

#if no new properties found, terminate script
#else, email properties
if not newProperties:
    sys.exit()
else: 
    fromaddr = 'someone@gmail.com'
    toaddrs = ['someoneelse@yahoo.com']
    server = smtplib.SMTP('smtp.gmail.com:587')
    server.starttls()

    for item in newProperties:
        msg = item
        server.sendmail(fromaddr, toaddrs, msg)

    server.quit()

The questions I have so far: (please bear with me here, as I'm a python novice..)

Using a list to store the web elements returned by using selenium's "find by class" method: Is there a better way to write in/out from the text file to ensure I am only getting the newly added properties?

If the script does find a class property that is present on the website but not on the newList, is there a way I can can go through that div only in order to get the details about the listing?

Any suggestions/recommendations please! Thank you.


Solution

  • What if you would switch to using JSON format having listings stored as dictionaries:

    [
        {
            "location": "REGON CITY, OR",
            "price": 33000000,
            "status": "active",
            "marketing_package_url": "http://www.capitalpacific.com/inquiry/TrailsEndMarketplaceExecSummary.pdf"
            ...
        },
        ...
    ]
    

    You would need something unique about every property in order identify new listings. You can, for example, use a marketing package url for it - looks unique for me.

    Here is an example code to get the listings list from a page:

    properties = []
    for property in driver.find_elements_by_css_selector('table.property div.property'):
        title = property.find_element_by_css_selector('div.title h2')
        location = property.find_element_by_css_selector('div.title h4')
        marketing_package = property.find_element_by_partial_link_text('Marketing Package')
    
        properties.append({
            'title': title.text,
            'location': location.text,
            'marketing_package_url': marketing_package.getAttribute('href')
        })