Search code examples
pythonseleniumselenium-webdriverweb-scrapingtext

Selenium .text() doesn't get all text from webpage, and also doesn't omit strikeout text (<strike class>)


Hi i'm having issues with selenium

My first issue is that when scraping, I accidentally obtain all "strike class" (striked-out) text from the below link, and I want to avoid that text.

My second issue is that selenium won't capture all the text from my webpage --Selenium gets about half of the text from the below link and stops in the middle of the sentence.

Any help would be greatly appreciated!!!

driver.get("https://custom.statenet.com/public/resources.cgi?id=ID:bill:KY2022000H740&ciq=ncsl32&client_md=4cf283759f8caf88663f3af9e2707c25&mode=current_text")
    
all_elems = driver.find_element(By.TAG_NAME, 'body')
        
all_text = all_elems.text
        
strikedout_elems = driver.find_element(By.TAG_NAME, 'strike')
        
strikedout_text = strikedout_elems.text

wanted_text = all_text.replace(undesired_text, "")
        
wanted_text = wanted_text.replace("\n\n", "")

Solution

  • You do not need the overheads of Selenium to obtain the text from that page, it can be acomplished with Requests and BeautifulSoup, like below (also ignoring the strikeout texts):

    import requests
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('https://custom.statenet.com/public/resources.cgi?id=ID:bill:KY2022000H740&ciq=ncsl32&client_md=4cf283759f8caf88663f3af9e2707c25&mode=current_text')
    soup = bs(r.text, 'html.parser')
    bad_texts = soup.select('strike[class="amendmentDeletedText"]')
    for b in bad_texts:
        b.decompose()
    text = soup.get_text(strip=True)
    print(text)
    

    This will return you all the text in that (static) page, and also remove the extra whitespace. Of course, there are more elegant ways of getting that text, like isolating paragraphs and subtitles, etc. However, this should respond your question.

    Requests documentation: https://requests.readthedocs.io/en/latest/

    BeautifulSoup documentation: https://beautiful-soup-4.readthedocs.io/en/latest/