Search code examples
pythonseleniumurltabswebdriver

Selenium web scraping: how to prioritize a tab over another


Project: saving all the URLs/titles from https://theuselessweb.com/

Code to test (only 3 pages and print not save):

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from time import sleep

PATH = r"C:\Users\XXX\Documents\scraping\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://theuselessweb.com/")
driver.switch_to.window(driver.window_handles[-1])
button = driver.find_element_by_id("button")

for i in range(3):
    button.click()
    sleep(2)
    driver.switch_to.window(driver.window_handles[-1])
    print(driver.current_url)
    print(driver.title)
    driver.close()

Error(s):

DevTools listening on ws://127.0.0.1:60235/devtools/browser/a5ea4ab0-fba6-4a34-b0ee-8926876c554f
[11636:4168:0626/143411.535:ERROR:device_event_log_impl.cc(214)] [14:34:11.535] USB: usb_device_handle_win.cc:1058 Failed to read descriptor from node connection: Ein an das System angeschlossenes Gerõt funktioniert nicht. (0x1F)
[11636:4168:0626/143411.552:ERROR:device_event_log_impl.cc(214)] [14:34:11.552] USB: usb_device_handle_win.cc:1058 Failed to read descriptor from node connection: Ein an das System angeschlossenes Gerõt funktioniert nicht. (0x1F)
[11636:4168:0626/143411.555:ERROR:device_event_log_impl.cc(214)] [14:34:11.555] USB: usb_device_handle_win.cc:1058 Failed to read descriptor from node connection: Ein an das System angeschlossenes Gerõt funktioniert nicht. (0x1F)
https://thatsthefinger.com/           #this is what I want
The finger, deal with it.             #this is what I want
Traceback (most recent call last):
  File "C:\Users\XXX\Documents\scraping\programs\linkscraping.py", line 16, in <module>
    button.click()
  File "C:\Users\XXX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\selenium\webdriver\remote\webelement.py", line 80, in click
    self._execute(Command.CLICK_ELEMENT)
  File "C:\Users\XXX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\selenium\webdriver\remote\webelement.py", line 633, in _execute
    return self._parent.execute(command, params)
  File "C:\Users\XXX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Users\XXX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=91.0.4472.124)

It prints out the URL and title of the first website and then crashes. Also everytime i run the driver.get(ANYURL) command, it opens the link AND the Chrome settings (chrome://settings/triggeredResetProfileSettings). Maybe this messes it up, anyway it would be really helpful if i could get rid of this unwanted window too.


Solution

  • Here is a solution to the problem. it still opens every link but since it's headless it's not visible to the user.

    In this case, X is the number of random websites you want to extract

    The code opens the site and then clicks the button the number of times you want in accordance with x and then goes on each one and logs the results. At the end, it closes Chrome.

    from selenium.webdriver.chrome.options import Options
    from selenium import webdriver
    from webdriver_manager.chrome import ChromeDriverManager
    
    options = Options()
    options.headless = True
    driver = webdriver.Chrome(
        ChromeDriverManager().install(), 
        options=options
    )
    
    x = 10
    
    driver.get('https://theuselessweb.com/')
    button = button = driver.find_element_by_id("button")
    
    for i in range(x):
        button.click()
    
    for i in range(x):
        driver.switch_to.window(driver.window_handles[i+1])
        print(driver.current_url)
        print(driver.title)
    
    driver.quit()