I would like to get all the list of word that are as dutch word = english word
from several pages.
By examining the HTML, it means that I need to get all the texts from all the li
of all the ul
from the child div of #mw-content-text
.
Here is my code:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless') # start chrome without opening window
driver = webdriver.Chrome(chrome_options=options)
listURL = [
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]
list_text = []
for url in listURL:
driver.get(url)
elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
for each_ul in elem:
all_li = each_ul.find_elements_by_tag_name("li")
for li in all_li:
list_text.append(li.text)
print(list_text)
Here is the output
['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
I don't understand why some li
text are not retrieve even though their xpath is the same (I double check several of them via the copy xpath of the developer console)
Try waiting for the page to fully load before parsing it, one way is to use the time.sleep()
method:
from time import sleep
...
for url in listURL:
driver.get(url)
sleep(5)
...
EDIT: Using BeautifulSoup
:
import requests
from bs4 import BeautifulSoup
listURL = [
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]
list_text = []
for url in listURL:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print("Link:", url)
for tag in soup.select("[id*=Lesson]:not([id*=Lessons])"):
print(tag.text)
print()
print(tag.find_next("ul").text)
print("-" * 80)
print()
Output (truncated):
Link: https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1
Lesson 1
man = man
vrouw = woman
jongen = boy
ik = I
ben = am
een = a/an
en = and
--------------------------------------------------------------------------------
Lesson 2
meisje = girl
kind = child/kid
hij = he
ze = she (unstressed)
is = is
of = or
--------------------------------------------------------------------------------
Lesson 3
appel = apple
... And on
If you want the output as a list
:
for url in listURL:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print("Link:", url)
print([tag.text for tag in soup.select(".mw-parser-output > ul li")])
print("-" * 80)