Search code examples
pythonweb-scrapingwikipedia

How do i fix my wikipedia table web-scraper - returns no cell values


I'm having a bit of trouble with my wikipedia table web-scraper: The trouble is that it will not read the text in the cells. I have defined the table - no problems there, i have defined the rows, no problem there. My code looks like this:

import requests
from bs4 import BeautifulSoup
import re
import dateutil

result = requests.get('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population')
assert result.status_code==200
print(result.status_code)

src = result.content
document = BeautifulSoup(src, 'lxml')

table = document.find('table')
table

assert table.find('th').get_text() == "Rank"

rows = table.find_all('tr')
rows

for row in rows[1:-1]:
    cells = row.find_all(['th'], ['td'])

    cells_text = [cell.get_text() for cell in cells]

    print(cells_text)

This provides me the following output:

200
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

Process finished with exit code 0

I have been following this video-tutorial https://www.youtube.com/watch?v=rzYeuMAo4Dw&t=641s. As far as i can see this dude has done the exact same thing as me - but his scraper apparently works where mine doesn't.

Im at a loss a to exactly what the problem is and how to fix it.


Solution

  • Put th, td to list together inside .find_all:

    import requests
    from bs4 import BeautifulSoup
    
    result = requests.get(
        "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
    )
    assert result.status_code == 200
    
    src = result.content
    document = BeautifulSoup(src, "lxml")
    
    table = document.find("table")
    assert table.find("th").get_text() == "Rank"
    rows = table.find_all("tr")
    
    for row in rows[1:-1]:
        cells = row.find_all(["th", "td"])       # <--- put th, td in list
        cells_text = [cell.get_text(strip=True) for cell in cells]
        print(cells_text)
    

    Prints:

    ['–', 'World', '7,892,391,000', '100%', '31 Aug 2021', 'UN projection[2]', '']
    ['1', 'China(more)', 'Asia', '1,411,778,724', '17.9%', '1 Nov 2020', '2020 census result[3]', 'The census figure refers tomainland China, excluding itsspecial administrative regionsofHong KongandMacau, the former of which returned to Chinese sovereignty on 1\xa0July 1997 and the latter on 20\xa0December 1999.']
    ['2', 'India(more)', 'Asia', '1,381,310,652', '17.5%', '31 Aug 2021', 'National population clock[4]', 'The figure includes the population of India-administered Kashmir but not of China- or Pakistan-administered Kashmir.']
    ['3', 'United States(more)', 'Americas', '332,282,961', '4.21%', '31 Aug 2021', 'National population clock[5]', 'Includes the50 statesand theDistrict of Columbia, but excludes theU.S. territories.']
    ['4', 'Indonesia(more)', 'Asia', '271,350,000', '3.44%', '31 Dec 2020', 'National annual estimate[6]', '']
    
    ...