Search code examples
pythonseleniumscreen-scrapingthesaurus

Web Scraping Thesaurus using Selenium


I'm fairly new to the web scraping world but I really need to do some web scraping on the Thesaurus website for a project I'm working on. I have successfully created a program using beautifulsoup4 that asks the user for a word, then returns the most likely synonyms based on Thesaurus. However, I would like to not only have those synonyms but also the synonyms of every sense of the word (which is depicted on Thesaurus by a list of buttons above the synonyms). I noticed that when clicking a button, the name of the classes also change, so I did a little digging and decided to go with Selenium instead of beautifulsoup. I have now a code that writes a word on the search bar and clicks it, however, I'm unable to get the synonyms or the said buttons, simply because the find_element finds nothing, and being new to this, I'm afraid I'm using the wrong syntax.

This is my code at the moment (it looks for synonyms of "good"):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time

PATH = "C:\Program Files (x86)\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(PATH)

driver.get("https://thesaurus.com")

search = driver.find_element_by_id("searchbar_input")
search.send_keys('good')
search.send_keys(Keys.RETURN)

try:
    headword = WebDriverWait(driver,10).until(
        EC.presence_of_element_located((By.ID, "headword"))
    )
    
    print(headword.text)
    #buttons = headword.find_element_by_class_name("css-bjn8wh e1br8a1p0")
    #print(buttons.text)

    meanings = WebDriverWait(driver,10).until(
        EC.presence_of_element_located((By.ID, "meanings"))
    )
    print(meanings.text)

    #words = meanings.find_elements_by_class_name("css-1kg1yv8 eh475bn0")
    #print(words.text)
    
    

except:
    print('failed')
    driver.quit()

For the first part, I want to access the buttons. The headword is simply the element that contains all the buttons I want to press. This is the headword element according to the inspect tool:

<div id="headword" class="css-bjn8wh e1br8a1p0">
    <div class="css-vw3jp5 e1ibdjtj4">
         *unecessary stuff*
    <div class="css-bjn8wh e1br8a1p0">
        <div class="postab-container css-cthfds ew5makj3">
            <ul class="css-gap396 ew5makj2">
                <li data-test-pos-tab="true" class="active-postab css-kgfkmr ew5makj4"> 
                    <a class="css-sc11zf ew5makj1">
                        <em class="css-1v93s5a ew5makj0">adj.</em>
                        <strong>pleasant, fine</strong>
                    </a>
                </li>
                <li data-test-pos-tab="true" class=" css-1ha4k0a ew5makj4">
                     *similar stuff*
                <li data-test-pos-tab="true" class=" css-1ha4k0a ew5makj4">
                ...

where each one these <li data-test-pos-tab="true" class=" css-1ha4k0a ew5makj4"> is a button I want to click. So far I have tried a bunch of things like the one showed in the code, and also things like:

buttons = headword.find_elements_by_class_name("css-1ha4k0a ew5makj4")
buttons = headword.find_elements_by_css_selector("css-1ha4k0a ew5makj4")
buttons = headword.find_elements_by_class_name("postab-container css-cthfds ew5makj3")
buttons = headword.find_elements_by_css_selector("postab-container css-cthfds ew5makj3")

but in any case Selenium can find these elements.

For the second part I want the synonyms. Here is the meaning element:

<div id="meanings" class="css-16lv1yi e1qo4u831">
    <div class="css-1f3egm3 efhksxz0">
        *unecessary stuff*
    <div data-testid="word-grid-container" class="css-ixatld e1cc71bi0">
        <ul class="css-1ngwve3 e1ccqdb60">
            <li>
                <a font-weight="inherit" href="/browse/acceptable" data-linkid="nn1ov4" class="css-1kg1yv8 eh475bn0">
                </a>
            </li>
            <li>
                <a font-weight="inherit" href="/browse/bad" data-linkid="nn1ov4" class="css-1kg1yv8 eh475bn0">
            ...

where each of these elements is a synonym I want to get. Similarly to the previous case I tried several things such as:

synGrid = meanings.find_element_by_class_name("css-ixatld e1cc71bi0")
synGrid = meanings.find_element_by_css_selector("css-ixatld e1cc71bi0")
words = meanings.find_elements_by_class_name("css-1kg1yv8 eh475bn0")
words = meanings.find_elements_by_css_selector("css-1kg1yv8 eh475bn0")

And again Selenium cannot find these elements... I would really appreciate some help in order to achieve this, even if it is just a push in the right direction instead of giving a full solution. Hope I wrote all the needed information, if not, please let me know.


Solution

  • If you use css selector then you have to use dot for class

    css_selector(".css-ixatld.e1cc71bi0") 
    

    and hash for id

    css_selector("#headword") 
    

    like you would use in files .css

    In css selector you can use also other methods avaliable in CSS.
    See css selectors on w3schools.com


    Selenium converts class_name to css selector but class_name() expects single name and Selenium has problems when there are two or more names. When it converts class_name to css_selector then it adds dot only before first name but it needs dot also before second and other names. So you have to manually add second dot

    class_name("css-ixatld.e1cc71bi0")