Search code examples
pythonscikit-learnfinancequantitative-finance

Fit/transform separate sklearn transformers to partitions of single column


Use case: I have time series data for multiple assets (eg. AAPL, MSFT) and multiple features (eg. MACD, Volatility etc). I am building a ML model to make classification predictions on a subset of this data.

Problem: For each asset & feature - I want to fit and apply a transformation. For example: for volatility, I want to fit a transformer for AAPL, MSFT... etc - and then apply that transformation to that partition of the data.

Current status: I currently use compose.make_column_transformer but this only applies a single transformer to the entire column volatility and does not allow partitioning of the data & individual transformers to be fit/applied to these partitions.

Research: I've done some research and come across sklearn.preprocessing.FunctionTransformer which seems to be a building block I could use. But haven't figured out how.

Main question: What is the best way to build a sklearn pipeline that can fit a transformer to a partition (ie. groupby) within a single column? Any code pointers would be great. TY

Example dataset:

Date Ticker Volatility transformed_vol
01/01/18 AAPL X A(X)
01/02/18 AAPL X A(X)
... AAPL X A(X)
12/30/22 AAPL X A(X)
12/31/22 AAPL X A(X)
01/01/18 GOOG X B(X)
01/02/18 GOOG X B(X)
... GOOG X B(X)
12/30/22 GOOG X B(X)
12/31/22 GOOG X B(X)

Solution

  • I don't think this is doable in an "elegant" way using Scikit's built-in functionality, simply because the transformers are applied on the whole column. However, one could use the FunctionalTransformer (as you correctly point out) to circumvent this limitation:

    I am using the following example:

    print(df)
    
      Ticker  Volatility  OtherCol
    0   AAPL           0         1
    1   AAPL           1         1
    2   AAPL           2         1
    3   AAPL           3         1
    4   AAPL           4         1
    5   GOOG           5         1
    6   GOOG           6         1
    7   GOOG           7         1
    8   GOOG           8         1
    9   GOOG           9         1
    

    I added another column just to demonstrate.

    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import FunctionTransformer
    
    # The index should dictate the groups along the column.
    df = df.set_index('Ticker')
    
    
    def A(x):
        return x*x
    
    
    def B(x):
        return 2*x
    
    
    def C(x):
        return 10*x
    
    
    # Map groups to function. A dict for each column and each group in the index.
    f_dict = {'Volatility': {'AAPL':A, 'GOOG':B}, 'OtherCol': {'AAPL':A, 'GOOG':C}}
    
    
    def pick_transform(df):
        return df.groupby(df.index).apply(lambda df: f_dict[df.columns[0]][df.index[0]](df))
                       
    
    ct = ColumnTransformer(
                           [(f'transformed_{col}', FunctionTransformer(func=pick_transform), [col])
                            for col in f_dict]
                          )
    
    df[[f'transformed_{col}' for col in f_dict]] = ct.fit_transform(df)
    
    print(df)
    

    Which results in:

            Volatility  OtherCol  transformed_vol  transformed_OtherCol
    Ticker                                                             
    AAPL             0         1                0                     1
    AAPL             1         1                1                     1
    AAPL             2         1                4                     1
    AAPL             3         1                9                     1
    AAPL             4         1               16                     1
    GOOG             5         1               10                    10
    GOOG             6         1               12                    10
    GOOG             7         1               14                    10
    GOOG             8         1               16                    10
    GOOG             9         1               18                    10
    

    Here you can add other columns in f_dict and then the transformer will be created in the list comprehension.