Search code examples
python-3.xpandasdataframefeature-engineering

Create one new column in pandas dataframe comprised of previous year stats for each player in the dataframe


(python) I currently have a pandas dataframe that looks something like this:

player        |     year     |     points     |
-----------------------------------------------
LeSean McCoy  |     2012     |     199.3      |
-----------------------------------------------
LeSean McCoy  |     2013     |     332.6      |
-----------------------------------------------
LeSean McCoy  |     2014     |     200.4      |
-----------------------------------------------

I'm trying to add a new column to the dataframe that holds a player's previous year points.

I can do a groupby that transforms the dataframe into one row in this example, with each year being its own column. However, I only want one added column, for example:

player        |     year     |     points     |     prev_year_pts     |
-----------------------------------------------------------------------
LeSean McCoy  |     2012     |     199.3      |        0              |
-----------------------------------------------------------------------
LeSean McCoy  |     2013     |     332.6      |        199.3          |
-----------------------------------------------------------------------
LeSean McCoy  |     2014     |     200.4      |        332.6          |
-----------------------------------------------------------------------

The true dataframe I'm working with has more than 300 unique player names, so I've been trying to get a solution on this example that would be able to also work with a different player name in the sample, with a desired output like this:

player               |     year     |     points     |     prev_year_pts     |
------------------------------------------------------------------------------
LeSean McCoy         |     2012     |     199.3      |        0              |
------------------------------------------------------------------------------
LeSean McCoy         |     2013     |     332.6      |        199.3          |
------------------------------------------------------------------------------
LeSean McCoy         |     2014     |     200.4      |        332.6          |
------------------------------------------------------------------------------
Christian McCaffrey  |     2017     |     228.6      |        0              |
------------------------------------------------------------------------------
Christian McCaffrey  |     2018     |     385.5      |        228.6          |
------------------------------------------------------------------------------
Christian McCaffrey  |     2019     |     471.2      |        385.5          |
------------------------------------------------------------------------------

I've been able to add a prev_year column with the following code:

example["prev_year"] = [x-1 for x in example.groupby(["player"])["year"].get_group("LeSean McCoy")]

But I'm stuck on how to get the prev_year_points from that, and how to implement in a way that could calculate that for each player observation ...


Solution

  • You can try to sort the values by player and year at first, and then do groupby + shift:

    df=df.sort_values(['player','year'])
    df['prev_year_pts']=df.groupby('player')['points'].shift(fill_value=0)
    

    So a little example with the sample you give:

    #create the dataframe
    d={'player': {0: 'LeSean McCoy', 1: 'LeSean McCoy', 2: 'LeSean McCoy', 3: 'Christian McCaffrey', 4: 'Christian McCaffrey', 5: 'Christian McCaffrey'},
        'year': {0: 2013, 1: 2012, 2: 2014, 3: 2019, 4: 2018, 5: 2017}, 'points': {0: 199.3, 1: 332.6, 2: 200.4, 3: 228.6, 4: 385.5, 5: 471.2}}
    
    df=pd.DataFrame(d)
    df
    #                player  year  points
    #0         LeSean McCoy  2013   199.3
    #1         LeSean McCoy  2012   332.6
    #2         LeSean McCoy  2014   200.4
    #3  Christian McCaffrey  2019   228.6
    #4  Christian McCaffrey  2018   385.5
    #5  Christian McCaffrey  2017   471.2
    
    
    df=df.sort_values(['player','year'])
    df
    #                player  year  points
    #5  Christian McCaffrey  2017   471.2
    #4  Christian McCaffrey  2018   385.5
    #3  Christian McCaffrey  2019   228.6
    #1         LeSean McCoy  2012   332.6
    #0         LeSean McCoy  2013   199.3
    #2         LeSean McCoy  2014   200.4
    
    df['prev_year_pts']=df.groupby('player')['points'].shift(fill_value=0)
    df
    #                player  year  points  prev_year_pts
    #5  Christian McCaffrey  2017   471.2            0.0
    #4  Christian McCaffrey  2018   385.5          471.2
    #3  Christian McCaffrey  2019   228.6          385.5
    #1         LeSean McCoy  2012   332.6            0.0
    #0         LeSean McCoy  2013   199.3          332.6
    #2         LeSean McCoy  2014   200.4          199.3