Search code examples
pythonbeautifulsoupscrape

scrape rotowire MLB player news and form into a table using python


i would like to scrape https://www.rotowire.com/baseball/news.php which contains news about MLB players and save the data in a table format like so:

date player headline news
4/17 Abner Uribe Picks up second win Uribe (2-1) earned the win Wednesday against the Padres after he allowed a hit and no walks in a scoreless eighth inning. He had one strikeout.
4/17 Richie Palacios Gets day off vs. lefty Palacios is out of the lineup for Wednesday's game against the Angels.

I'm having difficulties understanding how to isolate each of the content into their own rows into a dataframe. Looking for any help to get this going. Ideally I'd scrape every 5 minutes, and keep the table ever growing.


Solution

  • To get all info from that page into a dataframe you can use next example:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.rotowire.com/baseball/news.php"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    all_data = []
    for n in soup.select(".news-update"):
        name = n.a.text
        h = n.select_one(".news-update__headline").text
        dt = n.select_one(".news-update__timestamp").text
        news = n.select_one(".news-update__news").text
        all_data.append({"Name": name, "Headline": h, "Date": dt, "News": news})
    
    df = pd.DataFrame(all_data)
    print(df.head())
    

    Prints:

                   Name                            Headline            Date                                                                                                                                                News
    0       Joe Jacques              Recalled from Triple-A  April 17, 2024                                 Jacques was recalled from Triple-A Worcester by the Red Sox on Wednesday, Mac Cerullo of the Boston Herald reports.
    1    Cedric Mullins                     Walks off Twins  April 17, 2024                                                Mullins went 1-for-4 with a walk-off, two-run home run during Wednesday's 4-2 win against the Twins.
    2  Garrett Whitlock               Lands on injured list  April 17, 2024    Whitlock was placed on the 15-day injured list by the Red Sox on Wednesday with a left oblique strain, Mac Cerullo of the Boston Herald reports.
    3        Eli Morgan  Shelved with shoulder inflammation  April 17, 2024  The Guardians placed Morgan on the 15-day injured list Wednesday with right shoulder inflammation, Joe Noga of The Cleveland Plain Dealer reports.
    4     Craig Kimbrel                     Earns third win  April 17, 2024      Kimbel (3-0) earned the win Wednesday against the Twins after he retired all three batters he faced in the ninth inning. He had one strikeout.
    

    NOTE: I suggest put all this info into a SQL database (e.g. SQLite - it's included with python, not inserting any duplicates) and setup cronjob running this script every 5 minutes.