I'm having a bit of trouble with my wikipedia table web-scraper: The trouble is that it will not read the text in the cells. I have defined the table - no problems there, i have defined the rows, no problem there. My code looks like this:
import requests
from bs4 import BeautifulSoup
import re
import dateutil
result = requests.get('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population')
assert result.status_code==200
print(result.status_code)
src = result.content
document = BeautifulSoup(src, 'lxml')
table = document.find('table')
table
assert table.find('th').get_text() == "Rank"
rows = table.find_all('tr')
rows
for row in rows[1:-1]:
cells = row.find_all(['th'], ['td'])
cells_text = [cell.get_text() for cell in cells]
print(cells_text)
This provides me the following output:
200
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
Process finished with exit code 0
I have been following this video-tutorial https://www.youtube.com/watch?v=rzYeuMAo4Dw&t=641s. As far as i can see this dude has done the exact same thing as me - but his scraper apparently works where mine doesn't.
Im at a loss a to exactly what the problem is and how to fix it.
Put th
, td
to list together inside .find_all
:
import requests
from bs4 import BeautifulSoup
result = requests.get(
"https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
)
assert result.status_code == 200
src = result.content
document = BeautifulSoup(src, "lxml")
table = document.find("table")
assert table.find("th").get_text() == "Rank"
rows = table.find_all("tr")
for row in rows[1:-1]:
cells = row.find_all(["th", "td"]) # <--- put th, td in list
cells_text = [cell.get_text(strip=True) for cell in cells]
print(cells_text)
Prints:
['–', 'World', '7,892,391,000', '100%', '31 Aug 2021', 'UN projection[2]', '']
['1', 'China(more)', 'Asia', '1,411,778,724', '17.9%', '1 Nov 2020', '2020 census result[3]', 'The census figure refers tomainland China, excluding itsspecial administrative regionsofHong KongandMacau, the former of which returned to Chinese sovereignty on 1\xa0July 1997 and the latter on 20\xa0December 1999.']
['2', 'India(more)', 'Asia', '1,381,310,652', '17.5%', '31 Aug 2021', 'National population clock[4]', 'The figure includes the population of India-administered Kashmir but not of China- or Pakistan-administered Kashmir.']
['3', 'United States(more)', 'Americas', '332,282,961', '4.21%', '31 Aug 2021', 'National population clock[5]', 'Includes the50 statesand theDistrict of Columbia, but excludes theU.S. territories.']
['4', 'Indonesia(more)', 'Asia', '271,350,000', '3.44%', '31 Dec 2020', 'National annual estimate[6]', '']
...