Search code examples
pandasfunctionlist-comprehensionf-string

Why does concatenating a list of string items generated from the name of a DataFrame column fail inside this function?


I encountered a problem I can't explain when performing list concatenation inside vs. outside a function. Specifically, I tried to concatenate a list from a DataFrame column name inside a function and outside the function. Outside seems to work fine, but inside throws an error. See example below.

Example: Suppose I want to write a function that returns multiple lags of a column and assigns new names to the lagged variables based on the original column name. I could do something like this:

import pandas as pd
import numpy as np
rng = np.random.default_rng(22222222)

df = pd.DataFrame({'X':rng.random(10)})

print(df)
          X
0  0.279384
1  0.838032
2  0.298536
3  0.056188
4  0.532023
5  0.560038
6  0.127512
7  0.322774
8  0.813949
9  0.245242

Make the function:

def lagger(column, lags): #Takes as input a DataFrame column in the form df[colname]
    lags = [column.shift(i) for i in range(1, lags+1)]

    df = pd.concat(lags, axis=1) #a DataFrame with a column for each lag.

    names = [f"{column.name}_L{i}" for i in range(1,lags+1)] #generate new names

    df.rename(names, axis='columns', inplace=True)

    return df

Testing the function, I get this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[27], line 1
----> 1 lagger(df['X'], 3)

Cell In[25], line 6
      2 lags = [column.shift(i) for i in range(1, lags+1)]
      4 df = pd.concat(lags, axis=1) #a DataFrame with a column for each lag.
----> 6 names = [f"{column.name}_L{i}" for i in range(1,lags+1)] #generate new names
      8 df.rename(names, axis='columns', inplace=True)
     10 return df

TypeError: can only concatenate list (not "int") to list

Something's going on with the list comprehension. Let's try doing a similar comprehension outside the function:

[f"{df['X'].name}_L{i}" for i in range(1,4)]

['X_L1', 'X_L2', 'X_L3']

Works just fine!

So what's going on here? Why does this work outside the function but not inside?

To be clear, I've already found this answer that clears up how to generate multiple lags. I'm not asking about that. I'm asking what's the difference between doing [f"{column.name}_L{i}" for i in range(1,lags+1)] inside the function versus [f"{df['X'].name}_L{i}" for i in range(1,4)] outside the function? Why does it fail inside the function?


Solution

  • You overwrite your parameter lags with a list in your function. Thus range(1, lags+1) doesn't work anymore. Use another name. Also, your use of rename is incorrect, you should use set_axis:

    def lagger(column, lags): #Takes as input a DataFrame column in the form df[colname]
        l = [column.shift(i) for i in range(1, lags+1)]
        df = pd.concat(l, axis=1) #a DataFrame with a column for each lag.
        names = [f"{column.name}_L{i}" for i in range(1, lags+1)] #generate new names
        return df.set_axis(names, axis='columns')
    
    lagger(df['X'], 3)