Search code examples
pythonweb-scrapingwikipedia

How to scrape data from different Wikipedia pages?


I've scraped the wikipedia table using Python Beautifulsoup (https://en.wikipedia.org/wiki/Districts_of_Hong_Kong). But except for the offered data (i.e. population, area, density and region), I would like to get the location coordinates for each district. The data should get from another page of each district (there are the hyperlinks on the table).

Take the first district 'Central and Western District' for example, the DMS coordinates (22°17′12″N 114°09′18″E) can be found on the page. By further clicking the link, I could get the decimal coordinates (22.28666, 114.15497).

So, is it possible to create a table with Latitude and Longitude for each district?

New to the programming world, sorry if the question is stupid...

Reference:

DMS coordinates: https://en.wikipedia.org/wiki/Central_and_Western_District

Decimal coordinates: https://tools.wmflabs.org/geohack/geohack.php?pagename=Central_and_Western_District&params=22.28666_N_114.15497_E_type:adm2nd_region:HK


Solution

  • import requests
    from bs4 import BeautifulSoup
    
    res = requests.get('https://en.wikipedia.org/wiki/Districts_of_Hong_Kong')
    result = {}
    soup = BeautifulSoup(res.content,'lxml')
    tables = soup.find_all('table',{'class':'wikitable'})
    table = tables[0].find('tbody')
    districtLinks = table.find_all('a',href=True)
    
    for link in districtLinks:
        if link.getText() in link.attrs.get('title','') or link.attrs.get('title','') in link.getText():
            district = link.attrs.get('title','')
            if district:
                url = link.attrs.get('href', '')
            else:
                continue
        else:
            continue
        try:
            res = requests.get("https://en.wikipedia.org/{}".format(url))
        except:
            continue
        else:
            soup = BeautifulSoup(res.content, 'lxml')
            try:
                tables = soup.find_all('table',{'class':'infobox geography vcard'})
                table = tables[0].find('tbody')
            except:
                continue
            else:
                for row in table.find_all('tr',{'class':'mergedbottomrow'}):
                    geoLink = row.find('span',{'class': 'geo'}) # 'plainlinks nourlexpansion'
                    locationSplit = geoLink.getText().split("; ")
                    result.update({district : {"Latitude ": locationSplit[0], "Longitude":locationSplit[1]}})
    
    print(result)
    

    Result:

    {'Central and Western District': {'Latitude ': '22.28666', 'Longitude': '114.15497'}, 'Eastern District, Hong Kong': {'Latitude ': '22.28411', 'Longitude': '114.22414'}, 'Southern District, Hong Kong': {'Latitude ': '22.24725', 'Longitude': '114.15884'}, 'Wan Chai District': {'Latitude ': '22.27968', 'Longitude': '114.17168'}, 'Sham Shui Po District': {'Latitude ': '22.33074', 'Longitude': '114.16220'}, 'Kowloon City District': {'Latitude ': '22.32820', 'Longitude': '114.19155'}, 'Kwun Tong District': {'Latitude ': '22.31326', 'Longitude': '114.22581'}, 'Wong Tai Sin District': {'Latitude ': '22.33353', 'Longitude': '114.19686'}, 'Yau Tsim Mong District': {'Latitude ': '22.32138', 'Longitude': '114.17260'}, 'Islands District, Hong Kong': {'Latitude ': '22.26114', 'Longitude': '113.94608'}, 'Kwai Tsing District': {'Latitude ': '22.35488', 'Longitude': '114.08401'}, 'North District, Hong Kong': {'Latitude ': '22.49471', 'Longitude': '114.13812'}, 'Sai Kung District': {'Latitude ': '22.38143', 'Longitude': '114.27052'}, 'Sha Tin District': {'Latitude ': '22.38715', 'Longitude': '114.19534'}, 'Tai Po District': {'Latitude ': '22.45085', 'Longitude': '114.16422'}, 'Tsuen Wan District': {'Latitude ': '22.36281', 'Longitude': '114.12907'}, 'Tuen Mun District': {'Latitude ': '22.39163', 'Longitude': '113.9770885'}, 'Yuen Long District': {'Latitude ': '22.44559', 'Longitude': '114.02218'}}