Search code examples
pythonscikit-learnfeature-selection

FeatureAgglomeration: feature_names_in and get_feature_names_out


I used FeatureAgglomeration to cluster my 105x105 dataframe into 40 clusters based on Spearman. Now I want to get the output feature names using feature_names_in and get_feature_names_out, but it does not seem to work, and I cannot find the solution anymore. This is my code:

    import pandas as pd
    import numpy as np
    from sklearn.cluster import FeatureAgglomeration
    features = np.array([...])
    print(features.shape)
    >>> (105,)
    Class1_rank=pd.read_excel(r'H:\PycharmProjects\RadiomicsPipeline\Class1_rank.xlsx')
    print(Class1_rank)
    >>>                         original_shape_Elongation  ...  original_ngtdm_Strength
    original_shape_Elongation        1.000000  ...                -0.054310
    original_shape_Flatness          0.616327  ...                -0.019544
    original_shape_LeastAxisLength   0.271645  ...                -0.293157
    >>> [105 rows x 105 columns]
    print(agglo.n_features_in_)
    >>> 105
    print(agglo.feature_names_in_(Class1_rank))
    print(agglo.get_feature_names_out())
    df_reduced = agglo.transform(Class1)

At print(agglo.feature_names_in_()) I get to following error:

TypeError: 'numpy.ndarray' object is not callable

However, Class1_rank is a DataFrame, and thus should not give that error? What I am doing wrong here?

What I have tried:

  1. Comment print(agglo.feature_names_in_(Class1_rank)). Works, but then print(agglo.get features out) gives the following result, and not the names of the features I included.

    ['featureagglomeration0' 'featureagglomeration1' 'featureagglomeration2' 'featureagglomeration3' 'featureagglomeration4'....]

  2. Use features as input for both functions, gives the same error.

  3. Insert the features as strings for Class1_rank, gives the same error.


Solution

  • feature_names_in_ is an array, not a callable, so agglo.feature_names_in_ is correct, but parentheses after it (empty or not) is incorrect.

    get_feature_names_out() gives names for each cluster, which are not in 1-1 correspondence with input features, so it cannot give you something like the original feature names. You can use the labels_ attribute to find which input features go into which output features, see e.g. this answer.