Search code examples
pythonseleniumbeautifulsoupgetattribute

How can I scrape an attribute value as string as opposed to individual letters using get_attribute?


I'm using selenium (and possibly BS4) to scrape different parts of the match results pages (https://cuetracker.net/tournaments/gibraltar-open/2020/3542) for tournaments from the last 4/5 years which I have already scraped the links for.

I am trying to come up with some robust code to generally scrape different bits of data given in these match results. Initially I have tried to use the partial Xpath to scrape the nationality of each winning player(LHS) but when I try to get the attribute value it returns a list of letters as opposed to the nationalities as a string.

I'm thinking BS4 may possibly be more suitable for this as the format of the html can change with the addition of referee data in some tournaments but using partial Xpath seems okay from what little I know.

How can I get get_attribute to give me the values as strings and not individual letters?

Would it be easier to complete this scraping with BS4 as opposed to Selenium?

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.select import Select
from bs4 import BeautifulSoup

import os
import re
import time
import pandas as pd


def wait_for_page_load():
    timer = 15
    start_time = time.time()
    page_state = None
    while page_state != 'complete':
        time.sleep(0.5)
        page_state = browser.execute_script('return document.readyState;')
        if time.time() - start_time > timer:
            raise Exception('Timeout :(')


chrome_path = r"C:\Users\George\Desktop\chromedriver.exe"
browser = webdriver.Chrome(chrome_path)
page_source = browser.page_source

browser.get("https://cuetracker.net/seasons")
links = browser.find_elements_by_css_selector("table.table.table-striped a")
hrefs=[]
for link in links:
    hrefs.append(link.get_attribute("href"))

hrefs = hrefs[1:5]

hrefs2 = []

for href in hrefs:
    browser.get(href)
    wait_for_page_load()
    links2 = browser.find_element_by_xpath('.//tr/td[2]/a')
    for link in links2:
        hrefs2.append((link.get_attribute("href")))

Player_1_Nationality = []

for href in hrefs2:
    browser.get(href)
    wait_for_page_load()
    list_1_Nationality = browser.find_elements_by_xpath('.//div/div[2]/div[1]/b/img').get_attribute("alt")
    for lis in list_1_Nationality:
        Player_1_Nationality.append(lis)




['E',
 'n',
 'g',
 'l',
 'a',
 'n',
 'd',
 'E',
 'n',
 'g',
 'l',
 'a',
 'n',
 'd',
 'E',
 'n',
 'g',
 'l',
 'a',
 'n',
 'd',
 'E',
 'n',
 'g',
 'l',
 'a',
 'n',
 'd',
 'A',
 'u',
 's',
 't',
 'r',
 'a',
 'l',
 'i',
 'a',
 'E',
 'n',
 'g',
 'l',
 'a',
...


Solution

  • find_elements_by_xpath() returns list of elements.While iterating just use lis.get_attribute("alt")

    for href in hrefs2:
        browser.get(href)
        wait_for_page_load()
        list_1_Nationality = browser.find_elements_by_xpath('.//div/div[2]/div[1]/b/img')
        for lis in list_1_Nationality:
            Player_1_Nationality.append(lis.get_attribute("alt"))