Search code examples
pythonpandasbeautifulsoupwikipedia

bs4-approach to wikipedia-page: getting the infobox


i am currently trying to apply a bs4-approach to wikipedia-page: results do not store in a df

due to the fact that scraping on Wikipedia is a very very common technique - where we can use an appropiate approach to work with many many different jobs - i did have some issues with getting back the results - and store it into a df

well - as a example for a very common Wikipedia-bs4 job - we can take this one:

on this page we have more than 600 results - in sub-pages: url = "https://de.wikipedia.org/wikiListe_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland"

so to do a first experimental script i follow like so : first i scrape the table from the Wikipedia page and afterwards i convert it into a Pandas DataFrame. Therefore i first install necessary packages: Make sure you have requests, beautifulsoup4, and pandas installed. You can install them using pip if you haven't already:

pip install requests beautifulsoup4 pandas

and then i follow like so : first i scrape the table from the Wikipedia page and afterwards i convert it into a Pandas DataFrame.

import pandas as pd

# URL of the Wikipedia page
url = "https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland"
table = pd.read_html(url, extract_links='all')[1]
base_url = 'https://de.wikipedia.org'
table = table.apply(lambda col: [v[0] if v[1] is None else f'{base_url}{v[1]}' for v in  col])


links = list(table.iloc[:,0])

for link in links:
    print('\n',link)
    try:
        df = pd.read_html(link)[0]
        print(df)
    except Exception as e:
        print(e)

see what i get back - only two records. instead of hundreds. btw; i guess that the best way would be to collect all in a df. and & / or store it

Document is empty

 https://de.wikipedia.org/wiki/Aach_(Hegau)
                                       Wappen  \
0                                         NaN   
1                                         NaN   
2                                  Basisdaten   
3                                Koordinaten:   
4                                 Bundesland:   
5                           Regierungsbezirk:   
6                                  Landkreis:   
7                                       Höhe:   
8                                     Fläche:   
9                                  Einwohner:   
10                        Bevölkerungsdichte:   
11                              Postleitzahl:   
12                                   Vorwahl:   
13                           Kfz-Kennzeichen:   
14                         Gemeindeschlüssel:   
15                                    LOCODE:   
16              Adresse der  Stadtverwaltung:   
17                                   Website:   
18                             Bürgermeister:   
19  Lage der Stadt Aach im Landkreis Konstanz   
20                                      Karte   

                                     Deutschlandkarte  
0                                                 NaN  
1                                                 NaN  
2                                          Basisdaten  
3   47° 51′ N, 8° 51′ OKoordinaten: 47° 51′ N, 8° ...  
4                                   Baden-Württemberg  
5                                            Freiburg  
6                                            Konstanz  
7                                        545 m ü. NHN  
8                                           10,68 km2  
9                             2384 (31. Dez. 2022)[1]  
10                               223 Einwohner je km2  
11                                              78267  
12                                              07774  
13                                            KN, STO  
14                                        08 3 35 001  
15                                             DE AAC  
16                         Hauptstraße 16  78267 Aach  
17                                        www.aach.de  
18                                     Manfred Ossola  
19          Lage der Stadt Aach im Landkreis Konstanz  
20                                              Karte 

note: we have several hunderds records there: enter image description here

see the infobox: i am wanting to fetch the data of the infobox

enter image description here

update: what is aimed: - how to get full results - that are stored in a df. - and containing all the data - in the info.box.. (see image above) - with the contact infos etc

update2:

the overview - page: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland

it takes us to approx 1000 sub-pages: like the following

Aach (Hegau): https://de.wikipedia.org/wiki/Aach_(Hegau) Aachen: https://de.wikipedia.org/wiki/Aachen Aalen: https://de.wikipedia.org/wiki/Aalen

see a result- of the so called "info-box": https://de.wikipedia.org/wiki/Babenhausen_(Hessen) Babenhausen (Hessen)

+----------------------+--------------------------------------------------------------+
|                      |                                                              |
+----------------------+--------------------------------------------------------------+
| koordinaten:         | ♁49° 58′ N, 8° 57′ OKoordinaten: 49° 58′ N, 8° 57′ O | | OSM |
| Bundesland:          | Hessen                                                       |
| Regierungsbezirk:    | Darmstadt                                                    |
| Landkreis:           | Darmstadt-Dieburg                                            |
| Höhe:                | 124 m ü. NHN                                                 |
| Fläche:              | 66,85 km2                                                    |
| Einwohner:           | 17.579 (31. Dez. 2023)[1]                                    |
| Bevölkerungsdichte:  | 263 Einwohner je km2                                         |
| Postleitzahl:        | 64832                                                        |
| Vorwahl:             | 06073                                                        |
| Kfz-Kennzeichen:     | DA, DI                                                       |
| Gemeindeschlüssel:   | 06 4 32 002                                                  |
| Stadtgliederung:     | 6 Stadtteile                                                 |
| Adresse der          |                                                              |
| Stadtverwaltung:     | Rathaus                                                      |
| Marktplatz 2         |                                                              |
| 64832 Babenhausen    |                                                              |
| Website:             | www.babenhausen.de                                           |
| Bürgermeister:       | Dominik Stadler (parteilos)                                  |
+----------------------+--------------------------------------------------------------+

https://de.wikipedia.org/wiki/Bacharach https://de.wikipedia.org/wiki/Backnang

update3: if i run this code in order to fetch 300 records . it works well - if i run this in order to fetch 2400 it fails..

import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_info(city_url: str) -> dict:
    info_data = {}
    response = requests.get(city_url)
    soup = BeautifulSoup(response.text, 'lxml')
    for x in soup.find('tbody').find_all(
            lambda tag: tag.name == 'tr' and tag.get('class') == ['hintergrundfarbe-basis']):
        if not x.get('style'):
            if 'Koordinaten' in x.get_text():
                info_data['Koordinaten'] = x.findNext('span', class_='coordinates').get_text()
            else:
                info_data[x.get_text(strip=True).split(':')[0]] = x.get_text(strip=True).split(':')[-1]
                info_data['Web site'] = soup.find('a', {'title':'Website'}).findNext('a').get('href')
    return info_data


cities = []
response = requests.get('https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland')
soup = BeautifulSoup(response.text, 'lxml')
for city in soup.find_all('dd')#[:2500]:
    city_url = 'https://de.wikipedia.org' + city.findNext('a').get('href')
    result = {'City': city.get_text(), 'URL': 'https://de.wikipedia.org' + city.findNext('a').get('href')}
    result |= get_info(city_url)
    cities.append(result)
df = pd.DataFrame(cities)
print(df.to_string())


------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-28-4391c852fd75> in <cell line: 24>()
     25     city_url = 'https://de.wikipedia.org' + city.findNext('a').get('href')
     26     result = {'City': city.get_text(), 'URL': 'https://de.wikipedia.org' + city.findNext('a').get('href')}
---> 27     result |= get_info(city_url)
     28     cities.append(result)
     29 df = pd.DataFrame(cities)

<ipython-input-28-4391c852fd75> in get_info(city_url)
     15             else:
     16                 info_data[x.get_text(strip=True).split(':')[0]] = x.get_text(strip=True).split(':')[-1]
---> 17                 info_data['Web site'] = soup.find('a', {'title':'Website'}).findNext('a').get('href')
     18     return info_data
     19 

AttributeError: 'NoneType' object has no attribute 'findNext'

Solution

  • Every city in dd tag, so u can just use find_all() function to get Name and URL. Then go one by one every URL and get table. In the example only 5 repetition, delete [:5] in loop for full

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    
    def get_info(city_url: str) -> dict:
        info_data = {}
        response = requests.get(city_url)
        soup = BeautifulSoup(response.text, 'lxml')
        for x in soup.find('tbody').find_all(
                lambda tag: tag.name == 'tr' and tag.get('class') == ['hintergrundfarbe-basis']):
            if not x.get('style'):
                if 'Koordinaten' in x.get_text():
                    info_data['Koordinaten'] = x.findNext('span', class_='coordinates').get_text()
                else:
                    info_data[x.get_text(strip=True).split(':')[0]] = x.get_text(strip=True).split(':')[-1]
        if soup.find('a', {'title': 'Website'}):
            info_data['Web site'] = soup.find('a', {'title': 'Website'}).findNext('a').get('href')
        return info_data
    
    
    cities = []
    response = requests.get('https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_in_Deutschland#Liste_der_St%C3%A4dte_in_Deutschland')
    soup = BeautifulSoup(response.text, 'lxml')
    for city in soup.find_all('dd'):
        city_url = 'https://de.wikipedia.org' + city.findNext('a').get('href')
        result = {'City': city.get_text(), 'URL': 'https://de.wikipedia.org' + city.findNext('a').get('href')}
        result |= get_info(city_url)
        cities.append(result)
    df = pd.DataFrame(cities)
    print(df.to_string())
    

    OUTPUT:

                 City                                         URL           Koordinaten           Bundesland Regierungsbezirk            Landkreis         Höhe      Fläche                  Einwohner     Bevölkerungsdichte Postleitzahl Vorwahl       Kfz-Kennzeichen Gemeindeschlüssel   Adresse derStadtverwaltung   Bürgermeister      Postleitzahlen                  Vorwahlen             Stadtgliederung        Oberbürgermeisterin        Oberbürgermeister      Erste Bürgermeisterin Erster Bürgermeister
    0       Aach (BW)  https://de.wikipedia.org/wiki/Aach_(Hegau)   47° 51′ N, 8° 51′ O    Baden-Württemberg         Freiburg             Konstanz  545 m ü.NHN   10,68 km2     2384(31. Dez. 2022)[1]   223 Einwohner je km2        78267   07774                KN,STO       08 3 35 001     Hauptstraße 1678267 Aach  Manfred Ossola                 NaN                        NaN                         NaN                        NaN                      NaN                        NaN                  NaN
    1     Aachen (NW)        https://de.wikipedia.org/wiki/Aachen    50° 47′ N, 6° 5′ O  Nordrhein-Westfalen             Köln  Städteregion Aachen  175 m ü.NHN  160,85 km2  252.769(31. Dez. 2023)[1]  1571 Einwohner je km2          NaN     NaN               AC, MON       05 3 34 002            Markt52062 Aachen             NaN         52062–52080  0241, 02405, 02407, 02408               7Stadtbezirke  Sibylle Keupen(parteilos)                      NaN                        NaN                  NaN
    2      Aalen (BW)         https://de.wikipedia.org/wiki/Aalen   48° 50′ N, 10° 6′ O    Baden-Württemberg        Stuttgart          Ostalbkreis  430 m ü.NHN  146,58 km2   68.816(31. Dez. 2022)[1]   469 Einwohner je km2          NaN     NaN                AA, GD       08 1 36 088                          NaN             NaN  73430–73434, 73453        07361, 07366, 07367  Kernstadtund 8Stadtbezirke                        NaN  Frederick Brütting(SPD)                        NaN                  NaN
    3   Abenberg (BY)      https://de.wikipedia.org/wiki/Abenberg  49° 15′ N, 10° 58′ O               Bayern    Mittelfranken                 Roth  414 m ü.NHN   48,41 km2     5614(31. Dez. 2023)[1]   116 Einwohner je km2        91183   09178               RH, HIP       09 5 76 111  Stillaplatz 191183 Abenberg             NaN                 NaN                        NaN             14Gemeindeteile                        NaN                      NaN  Susanne König (parteilos)                  NaN
    4  Abensberg (BY)     https://de.wikipedia.org/wiki/Abensberg  48° 49′ N, 11° 51′ O               Bayern     Niederbayern              Kelheim  370 m ü.NHN   60,26 km2   14.685(31. Dez. 2023)[1]   244 Einwohner je km2        93326   09443  KEH,MAI,PAR, RID,ROL       09 2 73 111  Stadtplatz 193326 Abensberg             NaN                 NaN                        NaN             22Gemeindeteile                        NaN                      NaN                        NaN    Bernhard Resch[2]