Search code examples
pythonseleniumselenium-webdriverselenium-chromedrivertimeoutexception

Crawl data by Selenium but throws errors TimeoutException


I tried to crawl the reviews in the websites. For 1 website, it runs fine. however when I create a loop to crawl in many websites, it throws an error raise

TimeoutException(message, screen, stacktrace) TimeoutException

I tried to increase the waiting time from 30 to 50 now but it still does not run fine. here is my code :

import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from datetime import datetime

start_time = datetime.now()

result = pd.DataFrame()
df = pd.read_excel(r'D:\check_bols.xlsx')
ids = df['ids'].values.tolist() 

link = "https://www.bol.com/nl/ajax/dataLayerEndpoint.html?product_id="

for i in ids:
    
    link3 = link + str(i[-17:].replace("/",""))
    op = webdriver.ChromeOptions()
    op.add_argument('--ignore-certificate-errors')
    op.add_argument('--incognito')
    op.add_argument('--headless')
    driver = webdriver.Chrome(executable_path='D:/chromedriver.exe',options=op)
    driver.get(i)
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()

    soup = BeautifulSoup(driver.page_source, 'lxml')

    product_attributes = requests.get(link3).json()

    reviewtitle = [i.get_text() for i in soup.find_all("strong", class_="review__title") ]

    url = [i]*len(reviewtitle)

    productid = [product_attributes["dmp"]["productId"]]*len(reviewtitle)
  
    content= [i.get_text().strip()  for i in soup.find_all("div",attrs={"class":"review__body"})]
    
    author = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-name"})]

    date  = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-date"})]

    output = pd.DataFrame(list(zip(url, productid,reviewtitle, author, content, date )))
    
    result.append(output)
    
    result.to_excel(r'D:\bols.xlsx', index=False)
    
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Here are some links that I tried to crawl :

link1 link2


Solution

  • As mentioned in the comments - your timing out because you're looking for a button that does not exist.

    You need to catch the error(s) and skip those failling lines. You can do this with a try and except.

    I've put together an example for you. It's hard coded to one url (as I don't have your data sheet) and it's a fixed loop with purpose to keep TRYING to click the "show more" button, even after it's gone.

    With this solution be careful of your sync time. EACH TIME the WebDriverWait is called it will wait that full duration if it does not exist. You'll need to exit the expand loop when done (first time you trip the error) and keep your sync time tight - or it will be a slow script

    First, add these to your imports:

    from selenium.common.exceptions import TimeoutException
    from selenium.common.exceptions import StaleElementReferenceException
    

    Then this will run and not error:

    #not a fixed url:
    driver.get('https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/')
    
    #accept the cookie once
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
       
    for i in range(10):
        try:
            WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
            print("I pressed load more")
        except (TimeoutException, StaleElementReferenceException):
            pass
            print("No more to load - but i didn't fail")
    

    The output to the console is this:

    DevTools listening on ws://127.0.0.1:51223/devtools/browser/4b1a0033-8294-428d-802a-d0d2127c4b6f

    I pressed load more

    I pressed load more

    No more to load - but i didn't fail

    No more to load - but i didn't fail

    No more to load - but i didn't fail

    No more to load - but i didn't fail (and so on).

    This is how my browser looks - Note the size of the scroll bar for the link I used - it looks like it's got all the reviews: enter image description here