Search code examples
pythontime-seriessktime

How to know from which interval of the input the features used in sktime's TimeSeriesForestClassifier are calculated


I used the sktime library's TimeSeriesForestClassifier class to perform multivariate time series classification.

The code is as follows

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sktime.classification.compose import ColumnEnsembleClassifier
from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_basic_motions
from sktime.transformations.panel.compose import ColumnConcatenator

X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

steps = [
    ("concatenate", ColumnConcatenator()),
    ("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

I would like to check the value of feature_importances_, which is not the same length as the input, but an array with the same length as the number of features.

clf.steps[1][1].feature_importances_

I would like to know which part of the input each importance corresponds to. Is there any way to get information about which section of the input the TimeSeriesForestClassifier is calculating features from?


Solution

  • You can get the intervals (start and end index) for each tree of the ensemble from:

    clf.steps[1][1].intervals_
    

    sktime now also has an implementation of the newer Canonical Interval Forecast.

    When we first implemented the Time Series Forest algorithm, we ended up with two versions. The one that you're using is the recommended one, but the older version provides its own functionality for the feature importance graph (see below).

    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline
    
    from sktime.classification.compose import ColumnEnsembleClassifier
    from sktime.classification.compose import ComposableTimeSeriesForestClassifier
    from sktime.datasets import load_basic_motions
    from sktime.transformations.panel.compose import ColumnConcatenator
    
    X, y = load_basic_motions(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    
    steps = [
        ("concatenate", ColumnConcatenator()),
        ("classify", ComposableTimeSeriesForestClassifier(n_estimators=100)),
    ]
    clf = Pipeline(steps)
    clf.fit(X_train, y_train)
    clf.score(X_test, y_test)
    
    clf.steps[-1][-1].feature_importances_.rename(columns={"_slope": "slope"}).plot(xlabel="time", ylabel="feature importance")
    

    enter image description here

    Be aware of some subtle issues in the calculation and interpretation of the feature importances. The relevant issues are here: