python scikit-learn finance quantitative-finance

Fit/transform separate sklearn transformers to partitions of single column

Use case: I have time series data for multiple assets (eg. AAPL, MSFT) and multiple features (eg. MACD, Volatility etc). I am building a ML model to make classification predictions on a subset of this data.

Problem: For each asset & feature - I want to fit and apply a transformation. For example: for volatility, I want to fit a transformer for AAPL, MSFT... etc - and then apply that transformation to that partition of the data.

Current status: I currently use compose.make_column_transformer but this only applies a single transformer to the entire column volatility and does not allow partitioning of the data & individual transformers to be fit/applied to these partitions.

Research: I've done some research and come across sklearn.preprocessing.FunctionTransformer which seems to be a building block I could use. But haven't figured out how.

Main question: What is the best way to build a sklearn pipeline that can fit a transformer to a partition (ie. groupby) within a single column? Any code pointers would be great. TY

Example dataset:

Date	Ticker	Volatility	transformed_vol
01/01/18	AAPL	X	A(X)
01/02/18	AAPL	X	A(X)
...	AAPL	X	A(X)
12/30/22	AAPL	X	A(X)
12/31/22	AAPL	X	A(X)
01/01/18	GOOG	X	B(X)
01/02/18	GOOG	X	B(X)
...	GOOG	X	B(X)
12/30/22	GOOG	X	B(X)
12/31/22	GOOG	X	B(X)

Solution

I don't think this is doable in an "elegant" way using Scikit's built-in functionality, simply because the transformers are applied on the whole column. However, one could use the FunctionalTransformer (as you correctly point out) to circumvent this limitation:

I am using the following example:

print(df)

  Ticker  Volatility  OtherCol
0   AAPL           0         1
1   AAPL           1         1
2   AAPL           2         1
3   AAPL           3         1
4   AAPL           4         1
5   GOOG           5         1
6   GOOG           6         1
7   GOOG           7         1
8   GOOG           8         1
9   GOOG           9         1

I added another column just to demonstrate.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

# The index should dictate the groups along the column.
df = df.set_index('Ticker')


def A(x):
    return x*x


def B(x):
    return 2*x


def C(x):
    return 10*x


# Map groups to function. A dict for each column and each group in the index.
f_dict = {'Volatility': {'AAPL':A, 'GOOG':B}, 'OtherCol': {'AAPL':A, 'GOOG':C}}


def pick_transform(df):
    return df.groupby(df.index).apply(lambda df: f_dict[df.columns[0]][df.index[0]](df))
                   

ct = ColumnTransformer(
                       [(f'transformed_{col}', FunctionTransformer(func=pick_transform), [col])
                        for col in f_dict]
                      )

df[[f'transformed_{col}' for col in f_dict]] = ct.fit_transform(df)

print(df)

Which results in:

        Volatility  OtherCol  transformed_vol  transformed_OtherCol
Ticker                                                             
AAPL             0         1                0                     1
AAPL             1         1                1                     1
AAPL             2         1                4                     1
AAPL             3         1                9                     1
AAPL             4         1               16                     1
GOOG             5         1               10                    10
GOOG             6         1               12                    10
GOOG             7         1               14                    10
GOOG             8         1               16                    10
GOOG             9         1               18                    10

Here you can add other columns in f_dict and then the transformer will be created in the list comprehension.