i would like to scrape https://www.rotowire.com/baseball/news.php which contains news about MLB players and save the data in a table format like so:
date | player | headline | news | |
---|---|---|---|---|
4/17 | Abner Uribe | Picks up second win | Uribe (2-1) earned the win Wednesday against the Padres after he allowed a hit and no walks in a scoreless eighth inning. He had one strikeout. | |
4/17 | Richie Palacios | Gets day off vs. lefty | Palacios is out of the lineup for Wednesday's game against the Angels. | |
I'm having difficulties understanding how to isolate each of the content into their own rows into a dataframe. Looking for any help to get this going. Ideally I'd scrape every 5 minutes, and keep the table ever growing.
To get all info from that page into a dataframe you can use next example:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/baseball/news.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for n in soup.select(".news-update"):
name = n.a.text
h = n.select_one(".news-update__headline").text
dt = n.select_one(".news-update__timestamp").text
news = n.select_one(".news-update__news").text
all_data.append({"Name": name, "Headline": h, "Date": dt, "News": news})
df = pd.DataFrame(all_data)
print(df.head())
Prints:
Name Headline Date News
0 Joe Jacques Recalled from Triple-A April 17, 2024 Jacques was recalled from Triple-A Worcester by the Red Sox on Wednesday, Mac Cerullo of the Boston Herald reports.
1 Cedric Mullins Walks off Twins April 17, 2024 Mullins went 1-for-4 with a walk-off, two-run home run during Wednesday's 4-2 win against the Twins.
2 Garrett Whitlock Lands on injured list April 17, 2024 Whitlock was placed on the 15-day injured list by the Red Sox on Wednesday with a left oblique strain, Mac Cerullo of the Boston Herald reports.
3 Eli Morgan Shelved with shoulder inflammation April 17, 2024 The Guardians placed Morgan on the 15-day injured list Wednesday with right shoulder inflammation, Joe Noga of The Cleveland Plain Dealer reports.
4 Craig Kimbrel Earns third win April 17, 2024 Kimbel (3-0) earned the win Wednesday against the Twins after he retired all three batters he faced in the ninth inning. He had one strikeout.
NOTE: I suggest put all this info into a SQL database (e.g. SQLite - it's included with python, not inserting any duplicates) and setup cronjob running this script every 5 minutes.