Search code examples
pythonpandasdataframeseleniumwebdriverwait

Scraping tables from website using Selenium and Python


I'm new to python and was trying to scrape data from a website. So far, i can successfully scrape the data however the output is given in a linear format instead of the table its pulled from. Could someone give me some pointers on how to get the correct output?

import datetime
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

chrome_options = Options()
chrome_options.add_experimental_option( "prefs",{'profile.managed_default_content_settings.javascript': 2})
chrome_options.add_argument("-headless")
chrome = webdriver.Chrome('chromedriver', options = chrome_options)

today = datetime.date.today()
first = today.replace(day=1)
lastMonth = first - datetime.timedelta(days=1)

date_range = pd.date_range(start='2019-01-01', end=lastMonth, freq='MS')
date_range.strftime("%B/%Y")

date_list = [i.strftime('%B/%Y') for i in date_range]

base_url = "https://www.horseracing.net/tipsters/sky-sports-racing/alex-hammond/"

length = len(date_list)
i = 41 # set to 41 for testing purposes, I didnt want to loop all the tables until i sorted the output. 


while i < length:
    page_url = str(base_url)+""+str(date_list[i])
    chrome.get(page_url)
    EL2 = chrome.find_element(By.CLASS_NAME, 'main-table-container')
    print(EL2.text)
    i += 1

Current Output

Date
Pos
Tip
Track
Stake
Type
Odds
Result
30th
5th
Clipsham Gold
Yarmouth
1.0 WIN
Nap
9/4
LOST
29th
2nd
Cavendish
Bath
1.0 WIN
Nap
9/4
LOST
28th
2nd
King Of War
Brighton
1.0 WIN
Nap
2/1
LOST
27th
1st
The Vollan
Southwell
1.0 WIN
Nap
11/8
WON

Desired Output (all the sites appended into one dataframe, so I can export to an .xlsx)

screenshot


Solution

  • To create a DataFrame using Pandas from the This Month's Tips table within the website you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use the following Locator Strategies:

    • Using CSS_SELECTOR:

      driver.get('https://www.horseracing.net/tipsters/sky-sports-racing/alex-hammond/')
      WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div.cookie-accept-button.cookie-button-link"))).click()
      WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.close-modal"))).click()
      date = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-left-section']//span[@class='tiping-row-text' and contains(., '/')]")))]
      pose = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-left-section']//div[@class='tiping-row-cell position-cell']//span")))]
      tip = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-left-section']//div[@class='tiping-row-cell position-cell']//following::div[1]/span")))]
      track = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-right-section']//span[text()='Track']//following::span[1]")))]
      stake = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-right-section']//span[text()='Stake']//following::span[1]")))]
      types = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-right-section']//span[text()='Type']//following::span[1]")))]
      odd = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-right-section']//span[text()='Odds']//following::span[1]")))]
      result = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='main-table-tiping-row data-type-nap']//section[@class='cell-row-right-section']//div[@class='tiping-row-cell mobile-cell-hide']/span")))]
      df = pd.DataFrame(data=list(zip(date, pose, tip, track, stake, types, odd, result)), columns=['Date', 'Pos', 'Tip', 'Track', 'Stake', 'Type', 'Odds', 'Result'])
      print(df)
      
    • Note : You have to add the following imports :

      import pandas as pd          
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      
    • Console Output:

             Date   Pos               Tip         Track    Stake Type    Odds Result
      0  08/07/22   3rd       Documenting          York  1.0 WIN  Nap    12/1   LOST
      1  07/07/22   4th         Zechariah     Newmarket  1.0 WIN  Nap     9/4   LOST
      2  06/07/22  10th       Lord Cherry      Yarmouth  1.0 WIN  Nap     9/2   LOST
      3  05/07/22   3rd    Landen Calling     Uttoxeter  1.0 WIN  Nap    13/8   LOST
      4  04/07/22    PU  De Barley Basket     Worcester  1.0 WIN  Nap     4/1   LOST
      5  03/07/22   2nd     Caius Marcius  Market Rasen  1.0 WIN  Nap     9/2   LOST
      6  02/07/22   6th          Chairman       Sandown  1.0 WIN  Nap  100/30   LOST
      7  01/07/22   3rd      Rum Cocktail     Doncaster  1.0 WIN  Nap    11/4   LOST