Search code examples
pythonpandasdataframeweb-scrapingbeautifulsoup

Scraping MLB daily lineups from rotowire using python


I am trying to scrape the MLB daily lineup information from here: https://www.rotowire.com/baseball/daily-lineups.php

I am trying to use python with requests, BeautifulSoup and pandas.

My ultimate goal is to end up with two pandas data frames.

First is a starting pitching data frame:

date game_time pitcher_name team lineup_throws
2024-03-29 1:40 PM ET Spencer Strider ATL R
2024-03-29 1:40 PM ET Zack Wheeler PHI R

Second is a starting batter data frame:

date game_time batter_name team pos batting_order lineup_bats
2024-03-29 1:40 PM ET Ronald Acuna ATL RF 1 R
2024-03-29 1:40 PM ET Ozzie Albies ATL 2B 2 S
2024-03-29 1:40 PM ET Austin Riley ATL 3B 3 R
2024-03-29 1:40 PM ET Kyle Schwarber PHI DH 1 L
2024-03-29 1:40 PM ET Trea Turner PHI SS 2 R
2024-03-29 1:40 PM ET Bryce Harper PHI 1B 3 L

This would be for all game for a given day.

I've tried adapting this answer to my needs but can't seem to get it to quite work: Scraping Web data using BeautifulSoup

Any help or guidance is greatly appreciated.

Here is the code from the link I am trying to adapt, but can't seem to make progress:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

weather = []

for tag in soup.select(".lineup__bottom"):
    header = tag.find_previous(class_="lineup__teams").get_text(
        strip=True, separator=" vs "
    )
    rain = tag.select_one(".lineup__weather-text > b")
    forecast_info = rain.next_sibling.split()
    temp = forecast_info[0]
    wind = forecast_info[2]

    weather.append(
        {"Header": header, "Rain": rain.text.split()[0], "Temp": temp, "Wind": wind}
    )


df = pd.DataFrame(weather)
print(df)

The information I want seems to be contained in lineup__main and not in lineup__bottom.


Solution

  • You have to iterate the boxes and select all your expected features.

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    
    url = "https://www.rotowire.com/baseball/daily-lineups.php"
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    data_pitiching = []
    data_batter = []
    team_type = ''
    
    for e in soup.select('.lineup__box ul li'):
        if team_type != e.parent.get('class')[-1]:
            order_count = 1
            team_type = e.parent.get('class')[-1]
    
        if e.get('class') and 'lineup__player-highlight' in e.get('class'):
            data_pitiching.append({
                'date': e.find_previous('main').get('data-gamedate'),
                'game_time': e.find_previous('div', attrs={'class':'lineup__time'}).get_text(strip=True),
                'pitcher_name':e.a.get_text(strip=True),
                'team':e.find_previous('div', attrs={'class':team_type}).next.strip(),
                'lineup_throws':e.span.get_text(strip=True)
            })
        elif e.get('class') and 'lineup__player' in e.get('class'):
            data_batter.append({
                'date': e.find_previous('main').get('data-gamedate'),
                'game_time': e.find_previous('div', attrs={'class':'lineup__time'}).get_text(strip=True),
                'pitcher_name':e.a.get_text(strip=True),
                'team':e.find_previous('div', attrs={'class':team_type}).next.strip(),
                'pos': e.div.get_text(strip=True),
                'batting_order':order_count,
                'lineup_bats':e.span.get_text(strip=True)
            })
            order_count+=1
    
    df_pitching = pd.DataFrame(data_pitiching)
    df_batter = pd.DataFrame(data_batter)
    
    date game_time pitcher_name team lineup_throws
    0 2024-03-29 1:40 PM ET Freddy Peralta Brewers R
    1 2024-03-29 1:40 PM ET Jose Quintana Mets L
    ..
    19 2024-03-29 10:10 PM ET Bobby Miller Dodgers R
    date game_time pitcher_name team pos batting_order lineup_bats
    0 2024-03-29 1:40 PM ET J. Chourio Brewers RF 1 R
    1 2024-03-29 1:40 PM ET W. Contreras Brewers C 2 R
    ...
    178 2024-03-29 10:10 PM ET E. Hernandez Dodgers CF 8 R
    179 2024-03-29 10:10 PM ET Gavin Lux Dodgers 2B 9 L