How to scrape links from summary section / link list of wikipedia?

update: many thanks for the replies - the help and all the efforts! some additional notes i have added. below (at the end)

howdy i am trying to scrape all the Links of a large wikpedia page from the "List of Towns and Gemeinden in Bayern" on Wikipedia using python. The trouble is that I cannot figure out how to export all of the links containing the words "/wiki/" to my CSV file. I am used to Python a bit but some things are still kinda of foreign to me. Any ideas? Here is what I have so far...

the page:

from bs4 import BeautifulSoup as bs
import requests

res = requests.get("")
soup = bs(res.text, "html.parser")
gemeinden_in_bayern = {}
for link in soup.find_all("a"):
    url = link.get("href", "")
    if "/wiki/" in url:
        gemeinden_in_bayern[link.text.strip()] = url


the results do not look very specific:

  nt': ''}
    Kostenpflichtige Colab-Produkte - Hier können Sie Verträge kündigen

what is really aimed - is to geth the list like so:

btw: on a sidenote: on the above mentioned subpages i have information in the infobox - which i am able to gather. See an example:

import pandas
urlpage =  ''
data = pandas.read_html(urlpage)[0]
null = data.isnull()

for x in range(len(data)):
    first = data.iloc[x][0]
    second = data.iloc[x][1] if not null.iloc[x][1] else ""

which runs perfectly see the output:

Basisdaten Basisdaten 
Koordinaten: 49° 15′ N, 10° 58′ OKoordinaten: 49° 15′ N, 10° 58′ O 
Bundesland: Bayern 
Regierungsbezirk: Mittelfranken 
Landkreis: Roth 
Höhe: 414 m ü. NHN 
Fläche: 48,41 km2 
Einwohner: 5607 (31. Dez. 2022)[1] 
Bevölkerungsdichte: 116 Einwohner je km2 
Postleitzahl: 91183 
Vorwahl: 09178 
Kfz-Kennzeichen: RH, HIP 
Gemeindeschlüssel: 09 5 76 111 
Stadtgliederung: 14 Gemeindeteile 
Adresse der  Stadtverwaltung: Stillaplatz 1  91183 Abenberg 
Erste Bürgermeisterin: Susanne König (parteilos) 
Lage der Stadt Abenberg im Landkreis Roth Lage der Stadt Abenberg im Landkreis Roth 

And that said i found out that the infobox is a typical wiki-part. so if i get familiar on this part - then i have learned alot - for future tasks - not only for me but for many others more that are diving into the Topos of scraping-wiki pages. So this might be a general task - helpful and packed with lots of information for many others too.

so far so good: i have a list with pages that lead to quite a many infoboxes:

i think its worth to traverse over them - and fetch the infobox. the information you are looking for could be found with a python code that traverses over all the findindgs

....and so on and so forth - note: with that i would be able to traverse my above mentioned scraper that is able to fetch the data of one info-box.


again hello dear HedgeHog , hello dear Salman Khan ,

first of all - many many thanks for the quick help and your awesome support. Glad that you set me stragiht. i am very very glad. btw. now that we have all the Links of a large wikpedia page from the "List of Towns and Gemeinden in Bayern".

i would love to go ahead and work with the extraction of the infobox - which btw. would be a general task that might be interesting for many user on stackoverflow: conclusio: see the main page: and the subpage with the infobox:

and how i gather data:

which runs perfectly see the output:

what is aimed is to gather all the data of the infobox(es) from all the pages.

import requests
from bs4 import BeautifulSoup
import pandas as pd

def fetch_city_links(list_url):
    response = requests.get(list_url)
    if response.status_code != 200:
        print(f"Failed to retrieve the page: {list_url}")
        return []

    soup = BeautifulSoup(response.content, 'html.parser')
    divs = soup.find_all('div', class_='column-multiple')
    href_list = []

    for div in divs:
        li_items = div.find_all('li')
        for li in li_items:
            a_tags = li.find_all('a', href=True)
            href_list.extend(['' + a['href'] for a in a_tags])

    return href_list

def scrape_infobox(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    infobox = soup.find('table', {'class': 'infobox'})

    if not infobox:
        print(f"No infobox found on this page: {url}")
        return None

    data = {}
    for row in infobox.find_all('tr'):
        header = row.find('th')
        value = row.find('td')
        if header and value:
            data[header.get_text(" ", strip=True)] = value.get_text(" ", strip=True)

    return data

def main():
    list_url = ''
    city_links = fetch_city_links(list_url)

    all_data = []
    for link in city_links:
        print(f"Scraping {link}")
        infobox_data = scrape_infobox(link)
        if infobox_data:
            infobox_data['URL'] = link

    df = pd.DataFrame(all_data)
    df.to_csv('wikipedia_infoboxes.csv', index=False)

if __name__ == "__main__":
Well i thoght that this function orchestrates the process: it fetches the city links, scrapes the infobox data for each city, and stores the collected data in a pandas DataFrame. Finally, it saves the DataFrame to a CSV file.

BTW: i hope that this will not nukes the thread. i hope that this is okay here - this extended question - but if not - i can open a new thread! Thanks for all


  • Your selector is wrong.

    The names of towns are in a tag which is in li tag which in turn is under a div with class column-multiple.

    First, get all divs with class column-multiple and then get all the li items from the gathered divs and then get the href attribute of all the a tags inside.

    url = ""
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #find all the div elemnts with class column-multiple
        divs = soup.find_all('div', class_='column-multiple')
        href_list = []
        for div in divs:
            # Find all li elements within the div.column-multiple
            li_items = div.find_all('li')
            for li in li_items:
                #now get the href of all <a> tags in li items
                a_tags = li.find_all('a', href=True)
                href_list.extend([a['href'] for a in a_tags])
        for href in href_list:

    It will print what you want: