Search code examples
pythonpandasdataframenba-api

How to use nba_api to find all player seasons in which a player has averaged x steals per game, and then record total games with n steals?


I'm trying to find if steals for a given player are Poisson distributed for a single game. The idea was to create buckets for each decimal value of steals per game -- i.e. for all players who averaged 1.2 steals per game for a season over the past 10 or so years, what is the distribution of their single-game steal numbers in those seasons? How many total games with 0 steals, how many with 1 steal, etc. I was going to look at the variance and histogram of this data to see if it resembled a Poisson distribution with lambda = 1.2.

After several hours sifting through the nba_api documentation and then resorting to chatgpt, I've produced the following monstrosity of code.

The code doesn't work. It just runs forever, and then gives me some disconnection/runtime error or the following error:

ValueError: No objects to concatenate.

I tried the simplified instance below, where I just tried to create a list of players with 1.2 steals per game and then find the total number of games with 0 steals in those respective seasons. filtered_seasons should be a dataframe of qualifying seasons.

from nba_api.stats.static import players
from nba_api.stats.endpoints import playercareerstats, playergamelog
import pandas as pd
import datetime

# Get a list of all NBA players
nba_players = players.get_players()

# Initialize a list to store player seasons
player_seasons = []

# Initialize a list to store game logs for players with 1.2 steals per game
player_game_logs = []

# Iterate through the list of players
for player in nba_players:
    player_id = player['id']
    
    # Retrieve player career statistics
    career_stats = playercareerstats.PlayerCareerStats(player_id=player_id)
    
    # Get the DataFrame of player career stats
    career_stats_df = career_stats.get_data_frames()[0]
    
    # Filter for seasons with exactly 1.2 steals per game
    filtered_seasons = career_stats_df[career_stats_df['STL'] == 1.2]
    
    if not filtered_seasons.empty:
        player_seasons.append(filtered_seasons)
        
        # Iterate through the filtered seasons and fetch game logs
        for season in filtered_seasons['SEASON_ID']:
            game_log = playergamelog.PlayerGameLog(player_id=player_id, season=season)
            game_log_df = game_log.get_data_frames()[0]
            player_game_logs.append(game_log_df)

# Concatenate the filtered DataFrames
result_seasons_df = pd.concat(player_seasons, ignore_index=True)
result_game_logs_df = pd.concat(player_game_logs, ignore_index=True)

# Find games where players recorded 0 steals
zero_steals_games = result_game_logs_df[result_game_logs_df['STL'] == 0]

# Count the total number of games with 0 steals
total_zero_steals_games = len(zero_steals_games)

# Print the result
print(f"Total games with 0 steals: {total_zero_steals_games}")

Solution

  • The reason why your code loops "infinitely" is probably because you are going through every single player that has ever played in the NBA (or at least is retrievable trough the API). Making a large number of API calls in a short period of time can lead to rate limiting, which results in slow response times, timeouts, or even temporary suspension of access.

    Here is a code I wrote that outputs a plot of the top 10 players by PPG for the 22-23 season (again, it has to go through all players of the database). It will also print a string for each player containing their name so you can keep track of the process (and witness how slow it can be).

    from nba_api.stats.static import players
    from nba_api.stats.endpoints import playercareerstats, playergamelog
    import pandas as pd
    import matplotlib.pyplot as plt
    
    # Define the desired season
    desired_season = "2022-23"  # Replace with the desired season
    
    # Get a list of all NBA players
    nba_players = players.get_players()
    
    # Initialize a list to store player data
    player_data = []
    
    # Loop through the list of players
    for player in nba_players:
        player_name = player['full_name']
        player_id = player['id']
    
        # Retrieve player career statistics
        career_stats = playercareerstats.PlayerCareerStats(player_id=player_id)
    
        # Get the DataFrame of player career stats
        career_stats_df = career_stats.get_data_frames()[0]
    
        # Filter for the desired season
        filtered_season = career_stats_df[career_stats_df['SEASON_ID'] == desired_season]
    
        if not filtered_season.empty:
            print(f"Processing {player_name}...")
    
            # Get the player's game log for the desired season
            game_log = playergamelog.PlayerGameLog(player_id=player_id, season=desired_season)
            game_log_df = game_log.get_data_frames()[0]
    
            # Calculate the average points per game (PPG) for this player in the season
            avg_ppg = game_log_df['PTS'].mean()
    
            player_data.append({
                'Player Name': player_name,
                'PPG': avg_ppg
            })
    
    # Create a DataFrame from the collected player data
    player_data_df = pd.DataFrame(player_data)
    
    # Sort the players by PPG in descending order and select the top 10
    top_10_players_df = player_data_df.sort_values(by='PPG', ascending=False).head(10)
    
    # Plot a bar chart of the top 10 players by PPG
    plt.barh(top_10_players_df['Player Name'], top_10_players_df['PPG'])
    plt.xlabel('Points Per Game (PPG)')
    plt.ylabel('Player Name')
    plt.title(f'Top 10 Players by PPG in {desired_season}')
    plt.gca().invert_yaxis()  # Invert the y-axis to show the top player at the top
    plt.show()
    

    As for your code specifically, it works, but takes a lot of time since you're making a ton of API calls. The only problem in your code is that none of the players had averaged 1.2 steals per game, meaning the concatenation error you're receiving is because there is nothing to concatenate at the end. Here is an updated version of your code that prevents the error from happening and also contains some print lines for tracing purposes:

    from nba_api.stats.static import players
    from nba_api.stats.endpoints import playercareerstats, playergamelog
    import pandas as pd
    import datetime
    
    # Get a list of all NBA players
    nba_players = players.get_players()
    
    # Initialize a list to store player seasons
    player_seasons = []
    
    # Initialize a list to store game logs for players with 1.2 steals per game
    player_game_logs = []
    
    # Iterate through the list of players
    for player in nba_players:
        player_id = player['id']
        player_name = player['full_name']
    
        print(f"Processing {player_name}...")  # Print the player name
    
        # Retrieve player career statistics
        career_stats = playercareerstats.PlayerCareerStats(player_id=player_id)
    
        # Get the DataFrame of player career stats
        career_stats_df = career_stats.get_data_frames()[0]
    
        # Set the steals per game
        SPG = 1
    
        # Filter seasons with the desired steals per game
        filtered_seasons = career_stats_df[career_stats_df['STL'] == SPG]
    
        if not filtered_seasons.empty:
            player_seasons.append(filtered_seasons)
            print(f"{player_name} has {len(filtered_seasons)} season(s) with {SPG} steal(s) per game.")
    
            # Iterate through the filtered seasons and fetch game logs
            for season in filtered_seasons['SEASON_ID']:
                game_log = playergamelog.PlayerGameLog(player_id=player_id, season=season)
                game_log_df = game_log.get_data_frames()[0]
                player_game_logs.append(game_log_df)
    
    # Check if there is data to concatenate
    if player_seasons:
        # Concatenate the filtered DataFrames
        result_seasons_df = pd.concat(player_seasons, ignore_index=True)
        result_game_logs_df = pd.concat(player_game_logs, ignore_index=True)
    
        # Find games where players recorded 0 steals
        zero_steals_games = result_game_logs_df[result_game_logs_df['STL'] == 0]
    
        # Count the total number of games with 0 steals
        total_zero_steals_games = len(zero_steals_games)
    
        # Print the result
        print(f"Total games with 0 steals: {total_zero_steals_games}")
    else:
        print("No data to concatenate. No players meet the criteria.")
    

    I hope this answers your question.