update: many thanks for the replies - the help and all the efforts! some additional notes i have added. below (at the end)
howdy i am trying to scrape all the Links of a large wikpedia page from the "List of Towns and Gemeinden in Bayern" on Wikipedia using python. The trouble is that I cannot figure out how to export all of the links containing the words "/wiki/" to my CSV file. I am used to Python a bit but some things are still kinda of foreign to me. Any ideas? Here is what I have so far...
the page: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A
from bs4 import BeautifulSoup as bs
import requests
res = requests.get("https://en.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A")
soup = bs(res.text, "html.parser")
gemeinden_in_bayern = {}
for link in soup.find_all("a"):
url = link.get("href", "")
if "/wiki/" in url:
gemeinden_in_bayern[link.text.strip()] = url
print(gemeinden_in_bayern)
the results do not look very specific:
nt': 'https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Cookie_statement'}
Kostenpflichtige Colab-Produkte - Hier können Sie Verträge kündigen
what is really aimed - is to geth the list like so:
https://de.wikipedia.org/wiki/Abenberg
https://de.wikipedia.org/wiki/Abensberg
https://de.wikipedia.org/wiki/Absberg
https://de.wikipedia.org/wiki/Abtswind
btw: on a sidenote: on the above mentioned subpages i have information in the infobox - which i am able to gather. See an example:
import pandas
urlpage = 'https://de.wikipedia.org/wiki/Abenberg'
data = pandas.read_html(urlpage)[0]
null = data.isnull()
for x in range(len(data)):
first = data.iloc[x][0]
second = data.iloc[x][1] if not null.iloc[x][1] else ""
print(first,second,"\n")
which runs perfectly see the output:
Basisdaten Basisdaten
Koordinaten: 49° 15′ N, 10° 58′ OKoordinaten: 49° 15′ N, 10° 58′ O
Bundesland: Bayern
Regierungsbezirk: Mittelfranken
Landkreis: Roth
Höhe: 414 m ü. NHN
Fläche: 48,41 km2
Einwohner: 5607 (31. Dez. 2022)[1]
Bevölkerungsdichte: 116 Einwohner je km2
Postleitzahl: 91183
Vorwahl: 09178
Kfz-Kennzeichen: RH, HIP
Gemeindeschlüssel: 09 5 76 111
LOCODE: ABR
Stadtgliederung: 14 Gemeindeteile
Adresse der Stadtverwaltung: Stillaplatz 1 91183 Abenberg
Website: www.abenberg.de
Erste Bürgermeisterin: Susanne König (parteilos)
Lage der Stadt Abenberg im Landkreis Roth Lage der Stadt Abenberg im Landkreis Roth
And that said i found out that the infobox is a typical wiki-part. so if i get familiar on this part - then i have learned alot - for future tasks - not only for me but for many others more that are diving into the Topos of scraping-wiki pages. So this might be a general task - helpful and packed with lots of information for many others too.
so far so good: i have a list with pages that lead to quite a many infoboxes: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern#A
i think its worth to traverse over them - and fetch the infobox. the information you are looking for could be found with a python code that traverses over all the findindgs
https://de.wikipedia.org/wiki/Abenberg
https://de.wikipedia.org/wiki/Abensberg
https://de.wikipedia.org/wiki/Absberg
https://de.wikipedia.org/wiki/Abtswind
....and so on and so forth - note: with that i would be able to traverse my above mentioned scraper that is able to fetch the data of one info-box.
update
again hello dear HedgeHog , hello dear Salman Khan ,
first of all - many many thanks for the quick help and your awesome support. Glad that you set me stragiht. i am very very glad. btw. now that we have all the Links of a large wikpedia page from the "List of Towns and Gemeinden in Bayern".
i would love to go ahead and work with the extraction of the infobox - which btw. would be a general task that might be interesting for many user on stackoverflow: conclusio: see the main page: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern and the subpage with the infobox: https://de.wikipedia.org/wiki/Abenberg
and how i gather data:
import pandas
urlpage = 'https://de.wikipedia.org/wiki/Abenberg'
data = pandas.read_html(urlpage)[0]
null = data.isnull()
for x in range(len(data)):
first = data.iloc[x][0]
second = data.iloc[x][1] if not null.iloc[x][1] else ""
print(first,second,"\n")
which runs perfectly see the output:
Basisdaten Basisdaten
Koordinaten: 49° 15′ N, 10° 58′ OKoordinaten: 49° 15′ N, 10° 58′ O
Bundesland: Bayern
Regierungsbezirk: Mittelfranken
Landkreis: Roth
Höhe: 414 m ü. NHN
Fläche: 48,41 km2
Einwohner: 5607 (31. Dez. 2022)[1]
Bevölkerungsdichte: 116 Einwohner je km2
Postleitzahl: 91183
Vorwahl: 09178
Kfz-Kennzeichen: RH, HIP
Gemeindeschlüssel: 09 5 76 111
LOCODE: ABR
Stadtgliederung: 14 Gemeindeteile
Adresse der Stadtverwaltung: Stillaplatz 1 91183 Abenberg
Website: www.abenberg.de
Erste Bürgermeisterin: Susanne König (parteilos)
Lage der Stadt Abenberg im Landkreis Roth Lage der Stadt Abenberg im Landkreis Roth
what is aimed is to gather all the data of the infobox(es) from all the pages.
import requests
from bs4 import BeautifulSoup
import pandas as pd
def fetch_city_links(list_url):
response = requests.get(list_url)
if response.status_code != 200:
print(f"Failed to retrieve the page: {list_url}")
return []
soup = BeautifulSoup(response.content, 'html.parser')
divs = soup.find_all('div', class_='column-multiple')
href_list = []
for div in divs:
li_items = div.find_all('li')
for li in li_items:
a_tags = li.find_all('a', href=True)
href_list.extend(['https://de.wikipedia.org' + a['href'] for a in a_tags])
return href_list
def scrape_infobox(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
infobox = soup.find('table', {'class': 'infobox'})
if not infobox:
print(f"No infobox found on this page: {url}")
return None
data = {}
for row in infobox.find_all('tr'):
header = row.find('th')
value = row.find('td')
if header and value:
data[header.get_text(" ", strip=True)] = value.get_text(" ", strip=True)
return data
def main():
list_url = 'https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern'
city_links = fetch_city_links(list_url)
all_data = []
for link in city_links:
print(f"Scraping {link}")
infobox_data = scrape_infobox(link)
if infobox_data:
infobox_data['URL'] = link
all_data.append(infobox_data)
df = pd.DataFrame(all_data)
df.to_csv('wikipedia_infoboxes.csv', index=False)
if __name__ == "__main__":
main()
the Main Function:
def main():
list_url = 'https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern'
city_links = fetch_city_links(list_url)
all_data = []
for link in city_links:
print(f"Scraping {link}")
infobox_data = scrape_infobox(link)
if infobox_data:
infobox_data['URL'] = link
all_data.append(infobox_data)
df = pd.DataFrame(all_data)
df.to_csv('wikipedia_infoboxes.csv', index=False)
Well i thoght that this function orchestrates the process: it fetches the city links, scrapes the infobox data for each city, and stores the collected data in a pandas DataFrame. Finally, it saves the DataFrame to a CSV file.
BTW: i hope that this will not nukes the thread. i hope that this is okay here - this extended question - but if not - i can open a new thread! Thanks for all
Your selector is wrong.
The names of towns are in a tag which is in li tag which in turn is under a div with class column-multiple
.
First, get all divs with class column-multiple
and then get all the li items from the gathered divs and then get the href attribute of all the a tags inside.
url = "https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_und_Gemeinden_in_Bayern"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
#find all the div elemnts with class column-multiple
divs = soup.find_all('div', class_='column-multiple')
href_list = []
for div in divs:
# Find all li elements within the div.column-multiple
li_items = div.find_all('li')
for li in li_items:
#now get the href of all <a> tags in li items
a_tags = li.find_all('a', href=True)
href_list.extend([a['href'] for a in a_tags])
for href in href_list:
print(f"https://de.wikipedia.org{href}")
It will print what you want:
https://de.wikipedia.org/wiki/Amberg
https://de.wikipedia.org/wiki/Ansbach
https://de.wikipedia.org/wiki/Aschaffenburg
https://de.wikipedia.org/wiki/Augsburg
https://de.wikipedia.org/wiki/Bamberg
.
.
.