Use case: I have time series data for multiple assets (eg. AAPL, MSFT) and multiple features (eg. MACD, Volatility etc). I am building a ML model to make classification predictions on a subset of this data.
Problem: For each asset & feature - I want to fit and apply a transformation. For example: for volatility, I want to fit a transformer for AAPL, MSFT... etc - and then apply that transformation to that partition of the data.
Current status: I currently use compose.make_column_transformer
but this only applies a single transformer to the entire column volatility
and does not allow partitioning of the data & individual transformers to be fit/applied to these partitions.
Research: I've done some research and come across sklearn.preprocessing.FunctionTransformer
which seems to be a building block I could use. But haven't figured out how.
Main question: What is the best way to build a sklearn pipeline that can fit a transformer to a partition (ie. groupby) within a single column? Any code pointers would be great. TY
Example dataset:
Date | Ticker | Volatility | transformed_vol |
---|---|---|---|
01/01/18 | AAPL | X | A(X) |
01/02/18 | AAPL | X | A(X) |
... | AAPL | X | A(X) |
12/30/22 | AAPL | X | A(X) |
12/31/22 | AAPL | X | A(X) |
01/01/18 | GOOG | X | B(X) |
01/02/18 | GOOG | X | B(X) |
... | GOOG | X | B(X) |
12/30/22 | GOOG | X | B(X) |
12/31/22 | GOOG | X | B(X) |
I don't think this is doable in an "elegant" way using Scikit's built-in functionality, simply because the transformers are applied on the whole column. However, one could use the FunctionalTransformer
(as you correctly point out) to circumvent this limitation:
I am using the following example:
print(df)
Ticker Volatility OtherCol
0 AAPL 0 1
1 AAPL 1 1
2 AAPL 2 1
3 AAPL 3 1
4 AAPL 4 1
5 GOOG 5 1
6 GOOG 6 1
7 GOOG 7 1
8 GOOG 8 1
9 GOOG 9 1
I added another column just to demonstrate.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
# The index should dictate the groups along the column.
df = df.set_index('Ticker')
def A(x):
return x*x
def B(x):
return 2*x
def C(x):
return 10*x
# Map groups to function. A dict for each column and each group in the index.
f_dict = {'Volatility': {'AAPL':A, 'GOOG':B}, 'OtherCol': {'AAPL':A, 'GOOG':C}}
def pick_transform(df):
return df.groupby(df.index).apply(lambda df: f_dict[df.columns[0]][df.index[0]](df))
ct = ColumnTransformer(
[(f'transformed_{col}', FunctionTransformer(func=pick_transform), [col])
for col in f_dict]
)
df[[f'transformed_{col}' for col in f_dict]] = ct.fit_transform(df)
print(df)
Which results in:
Volatility OtherCol transformed_vol transformed_OtherCol
Ticker
AAPL 0 1 0 1
AAPL 1 1 1 1
AAPL 2 1 4 1
AAPL 3 1 9 1
AAPL 4 1 16 1
GOOG 5 1 10 10
GOOG 6 1 12 10
GOOG 7 1 14 10
GOOG 8 1 16 10
GOOG 9 1 18 10
Here you can add other columns in f_dict
and then the transformer will be created in the list comprehension.