Search code examples
pythonweb-scrapingbeautifulsoupwikipedia

Python Web Scraping: Extracting the Area of a Region in Wikipedia from the Infobox Geography Vcard


I know this sort of question has been dealt with numerous times, but after combing through answers and guides for hours, I just can't crack this and would enormously grateful for some help.

Ideally, I want to extract the area in square kilometers as listed in the Infobox on Wikipedia. For example, the code I run on https://en.wikipedia.org/wiki/Sandton should produce something along the lines of "143.54 km".

The code I've put together using numerous guides seems to work only on Wikipedia sites for whole countries where the "Area" is actually a link. Trying this on Spain's Wikipedia page:

from bs4 import BeautifulSoup
import requests

def getAdditionalDetails(URL):
    try:
        soup = BeautifulSoup(requests.get(URL).text, 'lxml')
        table = soup.find('table', {'class': 'infobox geography vcard'})
        additional_details = []
        read_content = False
        for tr in table.find_all('tr'):
            if (tr.get('class') == ['mergedtoprow'] and not read_content):
                link = tr.find('th')
                if (link.get_text().strip() == 'Area'):
                    read_content = True
                if (link.get_text().strip() == 'Population'):
                    read_content = False
            elif ((tr.get('class') == ['mergedrow'] or tr.get('class') == ['mergedbottomrow']) and read_content):
                additional_details.append(tr.find('td').get_text().strip('\n')) 
                if (tr.find('div').get_text().strip() != '•\xa0Total area'):
                    read_content = False
        return additional_details
    except Exception as error:
        print('Error occured: {}'.format(error))
        return []

URL = "https://en.wikipedia.org/wiki/Spain"
print(getAdditionalDetails(URL))

This outputs the almost usable:

['505,990[6]\xa0km2 (195,360\xa0sq\xa0mi) (51st)']

Can anyone much smarter than I assist?

Thank you.


Solution

  • Not the cleanest way to do this but here goes. If you want a specific row, start with that as the CSS selector.

    Code Example

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://en.wikipedia.org/wiki/Sandton'
    html = requests.get(url)
    soup = BeautifulSoup(html.text,'html.parser')
    
    area = soup.select('table > tbody > tr')[9].get_text(strip=True)
    area = area.replace('\xa0', '').split('(')[0]
    cleaned_area = area[7:]
    

    Output

    143.54 km2(55.42 sq mi)
    

    Explanation

    The area variable in this code we're selecting the rows specifically with the CSS selector.

    The get_text(strip=True) is the method to grab text but it strips all white space. You should know that \xa0 is non-breaking space in Latin1 encoding. The Strip=True will remove this at the start and end of the string.

    The output of the area variable without strip=True looks like this

    '\xa0•\xa0Total143.54\xa0km2 (55.42\xa0sq\xa0mi)'
    

    With strip=True

    '•\xa0Total143.54\xa0km2(55.42\xa0sq\xa0mi)'
    

    So you're still stuck within the string.

    Using the replace string method, we can replace \xa0 with a space.

    So the output

    '• Total143.54 km2(55.42 sq mi)'
    

    Then beacuse we actually don't need the first 7 characters, we just take from the 8th character onwards using slicing that comes with strings.

    Additional Information

    Encoding is a huge topic within python and computing in general, knowing a little abit about it is important. Essentially encoding exists because everything in computers is a byte whether like it or not. There has to be a translation from hardware to software and encoding is part of that step.

    We want to be able to convert characters into bits so that computer can do something them when we write code.

    The simplest type of encoding is ASCII which you may have already come across at some point. The entire ASCII table has 128 characters which correspond to 'Code Points'

    ASCII Code Point: 97

    Character: a

    Now you might ask what is the point in that ? Well we can turn this characters into code points which are easily translated to binary. That is easily converted into bits (A one or a zero) for the computer to do something with at the hardware level.

    Now the problem with ASCII is that there are more characters in human languages than 128 characters much more... So enter in a new types of encodings. Which are there many, the commonest one is Unicode and I've provied some resources to learn a little bit more on that.

    Now Latin-1 encoding is the default encoding for HTTP requests, where the requests library follows this encoding strictly.

    Some resources:

    The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

    Pragmatic Unicode

    Real Python | Encoding

    What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text