I have a large pandas dataframe with many time series of different lengths which looks something like this:
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=30, freq='Q'),
'series1': range(30),
'series2': range(30),
'series3': range(30)}).set_index('date')
df.loc[:11, 'series1'] = np.nan
df.loc[:6, 'series2'] = np.nan
I want to apply a statistical filter from statsmodels on each time series but the modul can't handle missing values and returns a dataframe full of NaNs.
import statsmodels.api as sm
df_cycles, df_trend = sm.tsa.filters.cffilter(df, low=6, high=40)
Is there a clean way to apply the function to each column individually starting from the first non-nan observations? I don't want to exclude the rows with missing values.
You can apply the filter for each column, and store the result in some object. In this case I am making a dictionary with the column name as the key and the cycles and a dataframe containing trend and results as the value.
filtered = {}
for c in df.columns:
cycles, trend = sm.tsa.filters.cffilter(
df[c].astype(float).dropna(),
low=6,
high=40
)
filtered[c] = pd.DataFrame({'cycles':cycles, 'trend':trend})
print(filtered['series1'])
# cycles trend
#date
#2015-12-31 8.881784e-16 11.0
#2016-03-31 6.661338e-16 11.0
#...
print(filtered['series3'])
# cycles trend
#date
#2013-03-31 0.0 0.0
#2013-06-30 0.0 0.0
#...
As you can see, for each column, the rows containing NaNs were dropped without affecting the other columns.