Search code examples
pythonpandasmappingtransformdefaultdict

Reassign pandas series values using nested defaultdict


I'm working on an NFL dataset and want to do the following mapping for every play in the df:

  • I'm trying to populate a column (DistToRusher) with the distance of each player to the rusher of that play.
  • The DistToRusher column is currently populated with player ids.
  • I'm trying to map these player ids to the ones in the inner dictionary keys and replace them with the inner dictionary values.
  • I have a defaultdict-of-dictionaries dist_dict that looks like this:
    dist_dict = {play_id1: {player_id1: distance, player_id2: distance ...}, 
                 play_id2: {player_id1: distance, player_id2: distance ...}...}

Here is my code:

def populate_DistToRusher_column(df):
    for play_id, players_dict in dist_dict.items():
        df[df.PlayId == play_id].replace({'DistToRusher': players_dict}, inplace=True)
    return df

This code runs, is slow (20-30s), and doesn't change DistToRusher column; when I inspect the df, DistToRusher still contains the player id numbers and not the distances.

Here is a toy version of the actual data:

from collections import defaultdict 
import pandas as pd
df = pd.DataFrame.from_dict({'PlayId': {
  0: 20170907000118, 1: 20170907000118, 2: 20170907000118,
  22: 20170907000139, 23: 20170907000139, 24: 20170907000139},
 'NflId': {0: 496723, 1: 2495116, 2: 2495493,
  22: 496723, 23: 2495116, 24: 2495493},
 'NflIdRusher': {0: 2543773, 1: 2543773, 2: 2543773,
  22: 2543773, 23: 2543773, 24: 2543773},
 'DistToRusher': {0: 496723, 1: 2495116, 2: 2495493,
  22: 496723, 23: 2495116, 24: 2495493}})

dist_dict = {20170907000118: defaultdict(float,
             {496723: 6.480871854928166,
              2495116: 4.593310353111358,
              2495493: 5.44898155621764}),
 20170907000139: defaultdict(float,
             {496723: 8.583355987025117,
              2495116: 5.821151088917024,
              2495493: 6.658686056573021})}

Solution

  • I think this is right, IIUC:

    temp = pd.DataFrame(dist_dict)
    df['DistToRusher2'] = df.apply(lambda x: temp[x.PlayId][x.NflId], axis=1)
    
    or
    
    df['DistToRusher2'] = df.apply(lambda x: dist_dict[x.PlayId][x.NflId], axis=1)
    
    

    output:

                PlayId    NflId  NflIdRusher  DistToRusher  DistToRusher2
    0   20170907000118   496723      2543773        496723       6.480872
    1   20170907000118  2495116      2543773       2495116       4.593310
    2   20170907000118  2495493      2543773       2495493       5.448982
    22  20170907000139   496723      2543773        496723       8.583356
    23  20170907000139  2495116      2543773       2495116       5.821151
    24  20170907000139  2495493      2543773       2495493       6.658686