Search code examples
pythonpandasweb-scrapingbeautifulsouppython-requests

How to scrape links from summary section / link list of wikipedia?


update: many thanks for the replies - the help and all the efforts! some additional notes i have added. below (at the end)

howdy i am trying to scrape all the Links of a large wikpedia page from the "List of Towns and Gemeinden in Bayern" on Wikipedia using python. The trouble is that I cannot figure out how to export all of the links containing the words "/wiki/" to my CSV file. I am used to Python a bit but some things are still kinda of foreign to me. Any ideas? Here is what I have so far...

the page: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A

from bs4 import BeautifulSoup as bs
import requests

res = requests.get("https://en.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A")
soup = bs(res.text, "html.parser")
gemeinden_in_bayern = {}
for link in soup.find_all("a"):
    url = link.get("href", "")
    if "/wiki/" in url:
        gemeinden_in_bayern[link.text.strip()] = url

print(gemeinden_in_bayern)

the results do not look very specific:

  nt': 'https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Cookie_statement'}
    Kostenpflichtige Colab-Produkte - Hier können Sie Verträge kündigen

what is really aimed - is to geth the list like so:

https://de.wikipedia.org/wiki/Abenberg
https://de.wikipedia.org/wiki/Abensberg
https://de.wikipedia.org/wiki/Absberg
https://de.wikipedia.org/wiki/Abtswind

btw: on a sidenote: on the above mentioned subpages i have information in the infobox - which i am able to gather. See an example:

import pandas
urlpage =  'https://de.wikipedia.org/wiki/Abenberg'
data = pandas.read_html(urlpage)[0]
null = data.isnull()

for x in range(len(data)):
    first = data.iloc[x][0]
    second = data.iloc[x][1] if not null.iloc[x][1] else ""
    print(first,second,"\n")

which runs perfectly see the output:

Basisdaten Basisdaten 
Koordinaten: 49° 15′ N, 10° 58′ OKoordinaten: 49° 15′ N, 10° 58′ O 
Bundesland: Bayern 
Regierungsbezirk: Mittelfranken 
Landkreis: Roth 
Höhe: 414 m ü. NHN 
Fläche: 48,41 km2 
Einwohner: 5607 (31. Dez. 2022)[1] 
Bevölkerungsdichte: 116 Einwohner je km2 
Postleitzahl: 91183 
Vorwahl: 09178 
Kfz-Kennzeichen: RH, HIP 
Gemeindeschlüssel: 09 5 76 111 
LOCODE: ABR 
Stadtgliederung: 14 Gemeindeteile 
Adresse der  Stadtverwaltung: Stillaplatz 1  91183 Abenberg 
Website: www.abenberg.de 
Erste Bürgermeisterin: Susanne König (parteilos) 
Lage der Stadt Abenberg im Landkreis Roth Lage der Stadt Abenberg im Landkreis Roth 

And that said i found out that the infobox is a typical wiki-part. so if i get familiar on this part - then i have learned alot - for future tasks - not only for me but for many others more that are diving into the Topos of scraping-wiki pages. So this might be a general task - helpful and packed with lots of information for many others too.

so far so good: i have a list with pages that lead to quite a many infoboxes: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A

i think its worth to traverse over them - and fetch the infobox. the information you are looking for could be found with a python code that traverses over all the findindgs

https://de.wikipedia.org/wiki/Abenberg
https://de.wikipedia.org/wiki/Abensberg
https://de.wikipedia.org/wiki/Absberg
https://de.wikipedia.org/wiki/Abtswind

....and so on and so forth - note: with that i would be able to traverse my above mentioned scraper that is able to fetch the data of one info-box.

update

again hello dear HedgeHog , hello dear Salman Khan ,

first of all - many many thanks for the quick help and your awesome support. Glad that you set me stragiht. i am very very glad. btw. now that we have all the Links of a large wikpedia page from the "List of Towns and Gemeinden in Bayern".

i would love to go ahead and work with the extraction of the infobox - which btw. would be a general task that might be interesting for many user on stackoverflow: conclusio: see the main page: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern and the subpage with the infobox: https://de.wikipedia.org/wiki/Abenberg

and how i gather data:

import pandas
urlpage =  'https://de.wikipedia.org/wiki/Abenberg'
data = pandas.read_html(urlpage)[0]
null = data.isnull()

for x in range(len(data)):
    first = data.iloc[x][0]
    second = data.iloc[x][1] if not null.iloc[x][1] else ""
    print(first,second,"\n")

which runs perfectly see the output:

Basisdaten Basisdaten 
Koordinaten: 49° 15′ N, 10° 58′ OKoordinaten: 49° 15′ N, 10° 58′ O 
Bundesland: Bayern 
Regierungsbezirk: Mittelfranken 
Landkreis: Roth 
Höhe: 414 m ü. NHN 
Fläche: 48,41 km2 
Einwohner: 5607 (31. Dez. 2022)[1] 
Bevölkerungsdichte: 116 Einwohner je km2 
Postleitzahl: 91183 
Vorwahl: 09178 
Kfz-Kennzeichen: RH, HIP 
Gemeindeschlüssel: 09 5 76 111 
LOCODE: ABR 
Stadtgliederung: 14 Gemeindeteile 
Adresse der  Stadtverwaltung: Stillaplatz 1  91183 Abenberg 
Website: www.abenberg.de 
Erste Bürgermeisterin: Susanne König (parteilos) 
Lage der Stadt Abenberg im Landkreis Roth Lage der Stadt Abenberg im Landkreis Roth 

what is aimed is to gather all the data of the infobox(es) from all the pages.

import requests
from bs4 import BeautifulSoup
import pandas as pd

def fetch_city_links(list_url):
    response = requests.get(list_url)
    if response.status_code != 200:
        print(f"Failed to retrieve the page: {list_url}")
        return []

    soup = BeautifulSoup(response.content, 'html.parser')
    divs = soup.find_all('div', class_='column-multiple')
    href_list = []

    for div in divs:
        li_items = div.find_all('li')
        for li in li_items:
            a_tags = li.find_all('a', href=True)
            href_list.extend(['https://de.wikipedia.org' + a['href'] for a in a_tags])

    return href_list

def scrape_infobox(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    infobox = soup.find('table', {'class': 'infobox'})

    if not infobox:
        print(f"No infobox found on this page: {url}")
        return None

    data = {}
    for row in infobox.find_all('tr'):
        header = row.find('th')
        value = row.find('td')
        if header and value:
            data[header.get_text(" ", strip=True)] = value.get_text(" ", strip=True)

    return data

def main():
    list_url = 'https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern'
    city_links = fetch_city_links(list_url)

    all_data = []
    for link in city_links:
        print(f"Scraping {link}")
        infobox_data = scrape_infobox(link)
        if infobox_data:
            infobox_data['URL'] = link
            all_data.append(infobox_data)

    df = pd.DataFrame(all_data)
    df.to_csv('wikipedia_infoboxes.csv', index=False)

if __name__ == "__main__":
    main()
    
    
    
    the Main Function:
    
    
    def main():
    list_url = 'https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern'
    city_links = fetch_city_links(list_url)
    
    all_data = []
    for link in city_links:
        print(f"Scraping {link}")
        infobox_data = scrape_infobox(link)
        if infobox_data:
            infobox_data['URL'] = link
            all_data.append(infobox_data)
    
    df = pd.DataFrame(all_data)
    df.to_csv('wikipedia_infoboxes.csv', index=False)
    

Well i thoght that this function orchestrates the process: it fetches the city links, scrapes the infobox data for each city, and stores the collected data in a pandas DataFrame. Finally, it saves the DataFrame to a CSV file.

BTW: i hope that this will not nukes the thread. i hope that this is okay here - this extended question - but if not - i can open a new thread! Thanks for all


Solution

  • Your selector is wrong.

    The names of towns are in a tag which is in li tag which in turn is under a div with class column-multiple.

    First, get all divs with class column-multiple and then get all the li items from the gathered divs and then get the href attribute of all the a tags inside.

    url = "https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #find all the div elemnts with class column-multiple
        divs = soup.find_all('div', class_='column-multiple')
        href_list = []
        for div in divs:
            # Find all li elements within the div.column-multiple
            li_items = div.find_all('li')
            for li in li_items:
                #now get the href of all <a> tags in li items
                a_tags = li.find_all('a', href=True)
                href_list.extend([a['href'] for a in a_tags])
        for href in href_list:
            print(f"https://de.wikipedia.org{href}")
    

    It will print what you want:

    https://de.wikipedia.org/wiki/Amberg
    https://de.wikipedia.org/wiki/Ansbach
    https://de.wikipedia.org/wiki/Aschaffenburg
    https://de.wikipedia.org/wiki/Augsburg
    https://de.wikipedia.org/wiki/Bamberg
    .
    .
    .