Search code examples
pythonseleniumweb-scrapingdata-miningdata-extraction

How to locate an element within bad html python selenium


I want to scrape the Athletic Director's information from this page. but the issue is that there is a strong tag that refers to the name and email of every person on the page. I only want an XPath that specifically extracts the exact name and email of the Athletic Director. Here is the link to the website for a better understanding of the code. "https://fhsaa.com/sports/2020/1/28/member_directory.aspx"

<div id="school_detail"><div class="row" align="center"><div class="bottom-spacing col-md-4"><button class="btn btn-primary btn-md" onclick="showAthleticFaculty(10)">Athletic Faculty</button></div><div class="bottom-spacing col-md-4"><button class="btn btn-primary btn-md" onclick="showCoachesAndSports(10)">Coaches &amp; Sports</button></div><div class="bottom-spacing col-md-4"><button class="btn btn-primary btn-md" onclick="showSchoolDetail(10)">School Information</button></div></div><br><h5 align="center"><u><strong>American (Hialeah) - Athletic Faculty</strong></u></h5><div><h6 class="athletic-faculty-header">Volunteer</h6></div><strong>N/A</strong><br><br><div><h6 class="athletic-faculty-header">Principal/Head Master</h6></div><strong>Name:</strong>  Stephen <br><strong>Email:</strong> <a href="mailto:[email protected]">[email protected]</a><br><br><div><h6 class="athletic-faculty-header">Athletic Director</h6></div><strong>Name:</strong>  Marcus Gabriel<br><strong>Email:</strong> <a href="mailto:[email protected]">[email protected]</a><br><br><div><h6 class="athletic-faculty-header">Assistant/Co AD</h6></div><strong>Name:</strong>  Ginette Torres<br><strong>Email:</strong> <a href="mailto:[email protected]">[email protected]</a><br><br><div><h6 class="athletic-faculty-header">Assistant/Vice Principal</h6></div><strong>Name:</strong>  Alex Gonzalez<br><strong>Email:</strong> <a href="mailto:[email protected]">[email protected]</a><br><br><div><h6 class="athletic-faculty-header">Administrative Assistant/Athletics</h6></div><strong>Name:</strong>  Shanell <br><strong>Email:</strong> <a href="mailto:[email protected]">[email protected]</a><br><br><div><h6 class="athletic-faculty-header">Financial/Bookkeeper Contact</h6></div><strong>Name:</strong>  Christopher Keighley<br><strong>Email:</strong> <a href="mailto:[email protected]">[email protected]</a><br><br><div><h6 class="athletic-faculty-header">Athletic Trainer</h6></div><strong>Name:</strong>  Gorin Aaron<br><strong>Email:</strong> <a href="mailto:[email protected]">[email protected]</a><br><br><div><h6 class="athletic-faculty-header">Medical - First Responder</h6></div><strong>N/A</strong><br><br></div>


Solution

  • to get the email id, use this :-

    //h6[text()='Athletic Director']/../following-sibling::strong[text()='Email:']/following-sibling::a
    

    Update :

    print(driver.find_element(By.XPATH, "//h6[text()='Athletic Director']/../following-sibling::strong[text()='Email:']/following-sibling::a").text)
    

    Update 1 :

    elem = driver.find_element(By.XPATH, "//h6[text()='Athletic Director']/../following-sibling::strong[text()='Name:']")
    name = driver.execute_script("return arguments[0].nextSibling.textContent;", elem)
    print(name)