I encountered a problem I can't explain when performing list concatenation inside vs. outside a function. Specifically, I tried to concatenate a list from a DataFrame column name inside a function and outside the function. Outside seems to work fine, but inside throws an error. See example below.
Example: Suppose I want to write a function that returns multiple lags of a column and assigns new names to the lagged variables based on the original column name. I could do something like this:
import pandas as pd
import numpy as np
rng = np.random.default_rng(22222222)
df = pd.DataFrame({'X':rng.random(10)})
print(df)
X
0 0.279384
1 0.838032
2 0.298536
3 0.056188
4 0.532023
5 0.560038
6 0.127512
7 0.322774
8 0.813949
9 0.245242
Make the function:
def lagger(column, lags): #Takes as input a DataFrame column in the form df[colname]
lags = [column.shift(i) for i in range(1, lags+1)]
df = pd.concat(lags, axis=1) #a DataFrame with a column for each lag.
names = [f"{column.name}_L{i}" for i in range(1,lags+1)] #generate new names
df.rename(names, axis='columns', inplace=True)
return df
Testing the function, I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[27], line 1
----> 1 lagger(df['X'], 3)
Cell In[25], line 6
2 lags = [column.shift(i) for i in range(1, lags+1)]
4 df = pd.concat(lags, axis=1) #a DataFrame with a column for each lag.
----> 6 names = [f"{column.name}_L{i}" for i in range(1,lags+1)] #generate new names
8 df.rename(names, axis='columns', inplace=True)
10 return df
TypeError: can only concatenate list (not "int") to list
Something's going on with the list comprehension. Let's try doing a similar comprehension outside the function:
[f"{df['X'].name}_L{i}" for i in range(1,4)]
['X_L1', 'X_L2', 'X_L3']
Works just fine!
So what's going on here? Why does this work outside the function but not inside?
To be clear, I've already found this answer that clears up how to generate multiple lags. I'm not asking about that. I'm asking what's the difference between doing [f"{column.name}_L{i}" for i in range(1,lags+1)]
inside the function versus [f"{df['X'].name}_L{i}" for i in range(1,4)]
outside the function? Why does it fail inside the function?
You overwrite your parameter lags
with a list in your function. Thus range(1, lags+1)
doesn't work anymore. Use another name. Also, your use of rename
is incorrect, you should use set_axis
:
def lagger(column, lags): #Takes as input a DataFrame column in the form df[colname]
l = [column.shift(i) for i in range(1, lags+1)]
df = pd.concat(l, axis=1) #a DataFrame with a column for each lag.
names = [f"{column.name}_L{i}" for i in range(1, lags+1)] #generate new names
return df.set_axis(names, axis='columns')
lagger(df['X'], 3)