Search code examples
pythonperformancepandasmultiplication

Multiplying a pandas column by a yearly coefficient in a fast way


I have a dataframe with datetime index:

df = pd.DataFrame(
    {'test':[1, 1, 1, 1, 1, 1]},
    index=[
        '2018-01-01', '2018-01-02', '2018-01-03',
        '2019-01-03', '2019-01-02', '2020-01-02'
    ]
 )
df.index=  pd.to_datetime(df.index)

I have a yearly parameter:

yearly_parameter = [1, 2, 3]

I would like to multiply efficiently (in a vectorized way?) the column 'test' by it's corresponding yearly parameter contained in the list yearly_parameter (first value is for 2018, second for 2019 and third for 2020). How can I do that efficiently? Is a list a good way to store those yearly parameters to do the calculation?

I expect the following result in a column, say 'answer':

df['answer'] = [1, 1, 1, 2, 2, 3]

print(df)

              test  answer
2018-01-01     1       1
2018-01-02     1       1
2018-01-03     1       1
2019-01-03     1       2
2019-01-02     1       2
2020-01-02     1       3

Thank you very much for your help,

Pierre


Solution

  • pd.factorize

    Use factorize to establish an ordering of years that should correspond to elements in yearly_parameter. Then we can efficiently multiply with array slicing.

    This expects that the length of yearly_parameter be at least as long as the number of unique years in df.index

    f, y = pd.factorize(df.index.year)
    
    yearly_parameter = np.array([1, 2, 3])
    
    df.assign(answer=df.test.values * yearly_parameter[f])
    
                test  answer
    2018-01-01     1       1
    2018-01-02     1       1
    2018-01-03     1       1
    2019-01-03     1       2
    2019-01-02     1       2
    2020-01-02     1       3
    

    np.unique

    Mind that this assumes that yearly_parameter aligns its first element with the first year that is observed. If you intend for the first element to correspond to the minimum year observed then you should use pd.factorize(df.index.year, sort=True). Or better yet, if you are going to sort then use an equivalent calculation in Numpy

    y, f = np.unique(df.index.year, return_inverse=True)
    
    yearly_parameter = np.array([1, 2, 3])
    
    df.assign(answer=df.test.values * yearly_parameter[f])
    
                test  answer
    2018-01-01     1       1
    2018-01-02     1       1
    2018-01-03     1       1
    2019-01-03     1       2
    2019-01-02     1       2
    2020-01-02     1       3