Search code examples
pythonselenium-webdriverbeautifulsouphref

link extraction using xpath and beautifulsoup not working


I want to extract link which is nested as /html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a in xpath , also see detailed nesting image

if helpful, these div have some class also.

I tried

from selenium import webdriver
from bs4 import BeautifulSoup

browser=webdriver.Chrome()
browser.get('https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024')

soup=BeautifulSoup(browser.page_source)

element = soup.find_element_by_xpath("./html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a")
href = element.get_attribute('href')
print(href)

this code gave error

 line 9, in <module>
    element = soup.find_element_by_xpath("./html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

and also tried other method

from selenium import webdriver
from bs4 import BeautifulSoup

browser=webdriver.Chrome()
browser.get('https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024')

soup=BeautifulSoup(browser.page_source)

href = soup('a')('div')[1]('div')[2]('div')[1]('div')[0]('div')[0]('div')[0]('div')[0]('div')[0][href]
#href = element.get_attribute('href')
print(href)

this gave error

    href = soup('a')('div')[1]('div')[2]('div')[1]('div')[0]('div')[0]('div')[0]('div')[0]('div')[0][href]
           ^^^^^^^^^^^^^^^^
TypeError: 'ResultSet' object is not callable

expected outcome should be : https://www.visionias.in/resources/material/?id=3731&type=daily_current_affairs or material/?id=3731&type=daily_current_affairs

Also some other links have same kind of nesting as above, is there any way to filter the links using the text inside/html/body/div[1]/div[2]/div[1]/div/div/p, for example text here is 18 may 2024, this p tag has an id also but it is not consisent or doesnt have a pattern, so not quite usuable to me.

I have seen other answers on stackoverflow but that isn't working for me

Also if possible please elaborate the answer, as I have to apply same code to some other sites as well.


Solution

  • Refer the selenium code below to extract all links and print it to console:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get("https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024")
    wait = WebDriverWait(driver, 10)
    
    links = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='center']//a")))
    
    for link in links:
        print(link.get_attribute("href"))
    

    Console output:

    https://www.visionias.in/resources/material?id=3731&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3729&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3727&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3723&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3717&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3715&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3705&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3703&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3701&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3699&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3690&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3688&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3687&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3684&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3682&type=daily_current_affairs
    https://www.visionias.in/resources/material?id=3676&type=daily_current_affairs
    
    Process finished with exit code 0
    

    SUGGESTION: I highly recommend you to read about absolute and relative XPaths. And the advantages of using relative over absolute XPaths. Few links below for your reference:

    UPDATE: Use the below code if you want to extract the link based on the specific date.

    link = wait.until(EC.visibility_of_element_located((By.XPATH, "//p[contains(text(),'18 May 2024')]//following::a[1]")))
    print(link.get_attribute("href"))