I'm fairly new to beautiful soup/Python/Web Scraping and I have been able to scrape data from a site, but I am only able to export the very first row to a csv file ( I want to export all scraped data into the file.)
I am stumped on how to make this code export ALL scraped data into multiple individual rows:
r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for span in soup.find_all("span", {"class": "article"}):
for link in span.select("a"):
name_and_date = link.text.split('(')
name = name_and_date[0].strip()
date = name_and_date[1].replace(')','').strip()
base_url = "https://www.infoplease.com"
links = link['href']
links = urljoin(base_url, links)
pres_data = {'Name': [name],
'Date': [date],
'Link': [links]
}
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])
df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)
print (df)
Any ideas here? I believe I need to loop it through the data parsing and grab each set and push it in. Am I going about this the right way?
Thanks for any insight
The way it is currently set up, it looks like you are not adding each link as a new entry and instead it is only adding the last link. If you initialize a list and add a dictionary like you have it set up for each iteration of the "links" for loop, you will add each row and not just the last one.
import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
pres_data = []
for span in soup.find_all("span", {"class": "article"}):
for link in span.select("a"):
name_and_date = link.text.split('(')
name = name_and_date[0].strip()
date = name_and_date[1].replace(')','').strip()
base_url = "https://www.infoplease.com"
links = link['href']
links = urljoin(base_url, links)
this_data = {'Name': name,
'Date': date,
'Link': links
}
pres_data.append(this_data)
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])
df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)
print (df)