Search code examples
pythonseleniumxpathhtml-parsing

Selenium python: get all the <li> text of all the <ul> from a <div>


I would like to get all the list of word that are as dutch word = english word from several pages.

By examining the HTML, it means that I need to get all the texts from all the li of all the ul from the child div of #mw-content-text.

Here is my code:

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window
driver = webdriver.Chrome(chrome_options=options)

listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
    elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_tag_name("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

Here is the output

['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

I don't understand why some li text are not retrieve even though their xpath is the same (I double check several of them via the copy xpath of the developer console)


Solution

  • Try waiting for the page to fully load before parsing it, one way is to use the time.sleep() method:

    from time import sleep
    ...
    
    for url in listURL:
        driver.get(url)
        sleep(5)
        ...
    

    EDIT: Using BeautifulSoup:

    import requests
    from bs4 import BeautifulSoup
    
    
    listURL = [
        "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
        "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
        "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
        "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
    ]
    
    
    list_text = []
    for url in listURL:
        soup = BeautifulSoup(requests.get(url).content, "html.parser")
        print("Link:", url)
        
        for tag in soup.select("[id*=Lesson]:not([id*=Lessons])"):
            print(tag.text)
            print()
            print(tag.find_next("ul").text)
            print("-" * 80)
        print()
    

    Output (truncated):

    Link: https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1
    Lesson 1
    
    man = man
    vrouw = woman
    jongen = boy
    ik = I
    ben = am
    een = a/an
    en = and
    --------------------------------------------------------------------------------
    Lesson 2
    
    meisje = girl
    kind = child/kid
    hij = he
    ze = she (unstressed)
    is = is
    of = or
    --------------------------------------------------------------------------------
    Lesson 3
    
    appel = apple
    
    ... And on
    

    If you want the output as a list:

    for url in listURL:
        soup = BeautifulSoup(requests.get(url).content, "html.parser")
        print("Link:", url)
        print([tag.text for tag in soup.select(".mw-parser-output > ul li")])
        print("-" * 80)