I am trying to scrape the MLB daily lineup information from here: https://www.rotowire.com/baseball/daily-lineups.php
I am trying to use python with requests, BeautifulSoup and pandas.
My ultimate goal is to end up with two pandas data frames.
First is a starting pitching data frame:
date | game_time | pitcher_name | team | lineup_throws |
---|---|---|---|---|
2024-03-29 | 1:40 PM ET | Spencer Strider | ATL | R |
2024-03-29 | 1:40 PM ET | Zack Wheeler | PHI | R |
Second is a starting batter data frame:
date | game_time | batter_name | team | pos | batting_order | lineup_bats |
---|---|---|---|---|---|---|
2024-03-29 | 1:40 PM ET | Ronald Acuna | ATL | RF | 1 | R |
2024-03-29 | 1:40 PM ET | Ozzie Albies | ATL | 2B | 2 | S |
2024-03-29 | 1:40 PM ET | Austin Riley | ATL | 3B | 3 | R |
2024-03-29 | 1:40 PM ET | Kyle Schwarber | PHI | DH | 1 | L |
2024-03-29 | 1:40 PM ET | Trea Turner | PHI | SS | 2 | R |
2024-03-29 | 1:40 PM ET | Bryce Harper | PHI | 1B | 3 | L |
This would be for all game for a given day.
I've tried adapting this answer to my needs but can't seem to get it to quite work: Scraping Web data using BeautifulSoup
Any help or guidance is greatly appreciated.
Here is the code from the link I am trying to adapt, but can't seem to make progress:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
weather = []
for tag in soup.select(".lineup__bottom"):
header = tag.find_previous(class_="lineup__teams").get_text(
strip=True, separator=" vs "
)
rain = tag.select_one(".lineup__weather-text > b")
forecast_info = rain.next_sibling.split()
temp = forecast_info[0]
wind = forecast_info[2]
weather.append(
{"Header": header, "Rain": rain.text.split()[0], "Temp": temp, "Wind": wind}
)
df = pd.DataFrame(weather)
print(df)
The information I want seems to be contained in lineup__main
and not in lineup__bottom
.
You have to iterate the boxes and select all your expected features.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data_pitiching = []
data_batter = []
team_type = ''
for e in soup.select('.lineup__box ul li'):
if team_type != e.parent.get('class')[-1]:
order_count = 1
team_type = e.parent.get('class')[-1]
if e.get('class') and 'lineup__player-highlight' in e.get('class'):
data_pitiching.append({
'date': e.find_previous('main').get('data-gamedate'),
'game_time': e.find_previous('div', attrs={'class':'lineup__time'}).get_text(strip=True),
'pitcher_name':e.a.get_text(strip=True),
'team':e.find_previous('div', attrs={'class':team_type}).next.strip(),
'lineup_throws':e.span.get_text(strip=True)
})
elif e.get('class') and 'lineup__player' in e.get('class'):
data_batter.append({
'date': e.find_previous('main').get('data-gamedate'),
'game_time': e.find_previous('div', attrs={'class':'lineup__time'}).get_text(strip=True),
'pitcher_name':e.a.get_text(strip=True),
'team':e.find_previous('div', attrs={'class':team_type}).next.strip(),
'pos': e.div.get_text(strip=True),
'batting_order':order_count,
'lineup_bats':e.span.get_text(strip=True)
})
order_count+=1
df_pitching = pd.DataFrame(data_pitiching)
df_batter = pd.DataFrame(data_batter)
date | game_time | pitcher_name | team | lineup_throws | |
---|---|---|---|---|---|
0 | 2024-03-29 | 1:40 PM ET | Freddy Peralta | Brewers | R |
1 | 2024-03-29 | 1:40 PM ET | Jose Quintana | Mets | L |
.. | |||||
19 | 2024-03-29 | 10:10 PM ET | Bobby Miller | Dodgers | R |
date | game_time | pitcher_name | team | pos | batting_order | lineup_bats | |
---|---|---|---|---|---|---|---|
0 | 2024-03-29 | 1:40 PM ET | J. Chourio | Brewers | RF | 1 | R |
1 | 2024-03-29 | 1:40 PM ET | W. Contreras | Brewers | C | 2 | R |
... | |||||||
178 | 2024-03-29 | 10:10 PM ET | E. Hernandez | Dodgers | CF | 8 | R |
179 | 2024-03-29 | 10:10 PM ET | Gavin Lux | Dodgers | 2B | 9 | L |