Search code examples
pythonseleniumgoogle-colaboratory

How to get <b> tag without class using Selenium


I am working on getting information about a product listed here. I am using Selenium and Google Colab . I am having a problem accessing the text on the b tag. For other attributes such as name, seller, price, etc. can be scraped without problems.

This is the snippet of the HTML.

<div class="css-1le9c0d pad-bottom">
    <img src="https://assets.tokopedia.net/assets-tokopedia-lite/v2/zeus/kratos/3ac8f50c.svg" alt="">
    <div>Dikirim dari 
      <b>Kota Depok</b>
    </div>
</div>

This is my driver settings.

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
webdriver_path = webdriver.Chrome('chromedriver', options=options)
driver = webdriver.Chrome('chromedriver', options=options)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                                                                     'AppleWebKit/537.36 (KHTML, like Gecko) '
                                                                     'Chrome/85.0.4183.102 Safari/537.36'})

This is the code that I have tried.

sample_link = 'https://www.tokopedia.com/naturashop27/bio-oil-original-penghilang-bekas-luka-strecth-mark-isi-125ml?whid=0'
driver.get(sample_link)
time.sleep(1.5)

try:
    product = driver.find_elements_by_tag_name('h1')[0].text
except:
    product = np.nan

try:
    shop_url = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[@class='css-1n8curp']"))).get_attribute("href")
except:
    shop_url = np.nan

# ....

try:
    WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class,'pad-bottom')]//b")))
    loc = driver.find_element_by_xpath("//div[contains(@class,'pad-bottom')]//b").text
except:
    loc = np.nan

This is the output from the code above. As you can see, the text on the b tag is nan instead of Kota Depok.

Bio Oil Original Penghilang Bekas Luka & Strecth Mark isi 125ml
https://www.tokopedia.com/naturashop27
nan

Please see the solution below. The issues are the following:

  • element is not loaded fully before scraping the element.
  • Using driver.set_window_size(1124,850) works in Colab.

Solution

  • You may wanna try this :

    Element is not in Selenium view port, you need to scroll a bit to get the job done.

    try:
        driver.execute_script("window.scrollTo(0, 100)")
        print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'pad-bottom')]"))).text)
    except:
        loc = np.nan
    

    O/P :

    Dikirim dari Kota Depok
    
    Process finished with exit code 0
    

    I have used this xpath : //div[contains(@class, 'pad-bottom')] that will print Dikirim dari Kota Depok

    if you use //div[contains(@class,'pad-bottom')]//b you will get Kota Depok

    Update 1 :

    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get("https://www.tokopedia.com/naturashop27/bio-oil-original-penghilang-bekas-luka-strecth-mark-isi-125ml?whid=0")
    wait = WebDriverWait(driver, 10)
    
    try:
        print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.TAG_NAME, "h1"))).text)
    except:
        product = np.nan
    
    try:
        shop_url = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[@data-testid='llbPDPFooterShopName']"))).get_attribute("href")
        print(shop_url)
    except:
        shop_url = np.nan
    
    try:
        driver.execute_script("window.scrollTo(0, 100)")
        print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'pad-bottom')]"))).text)
    except:
        loc = np.nan
    

    This gives me :

    Bio Oil Original Penghilang Bekas Luka & Strecth Mark isi 125ml
    https://www.tokopedia.com/naturashop27
    Dikirim dari Kota Depok
    
    Process finished with exit code 0