Search code examples
pythonpandasdataframescikit-learnsklearn-pandas

Python sklearn-pandas Transform Multiple Columns at the same time error


I am using python with pandas and sklearn and trying to use the new and very convenient sklearn-pandas.

I have a big data frame and need to transform multiple columns in a similar way.

I have multiple column names in the variable other the source code documentation here states explicitly there is a possibility of transforming multiple columns with the same transformation, but the following code does not behave as expected:

from sklearn.preprocessing import MinMaxScaler, LabelEncoder

mapper = DataFrameMapper([[other[0],other[1]],LabelEncoder()])
mapper.fit_transform(df.copy())

I get the following error:

raise ValueError("bad input shape {0}".format(shape)) ValueError: ['EFW', 'BPD']: bad input shape (154, 2)

When I use the following code, it works great:

cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
mapper = DataFrameMapper(cols)
mapper.fit_transform(df.copy())

To my understanding, both should work well and yield same results. What am I doing wrong here?

Thanks!


Solution

  • The problem you encounter here, is that the two snippets of code are completely different in terms of data structure.

    cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)] builds a list of tuples. Do note that you can shorten this line of code to:

    cols = [(col, LabelEncoder()) for col in other]
    

    Anyway, the first snippet, [[other[0],other[1]],LabelEncoder()] results in a list containing two elements: a list and a LabelEncoder instance. Now, it is documented that you can transform multiple columns through specifying:

    Transformations may require multiple input columns. In these cases, the column names can be specified in a list:

    mapper2 = DataFrameMapper([ (['children', 'salary'], sklearn.decomposition.PCA(1)) ])

    This is a list containing tuple(list, object) structured elements, not list[list, object] structured elements.

    If we take a look at the source code itself,

    class DataFrameMapper(BaseEstimator, TransformerMixin):
        """
        Map Pandas data frame column subsets to their own
        sklearn transformation.
        """
    
        def __init__(self, features, default=False, sparse=False, df_out=False,
                     input_df=False):
            """
            Params:
            features    a list of tuples with features definitions.
                        The first element is the pandas column selector. This can
                        be a string (for one column) or a list of strings.
                        The second element is an object that supports
                        sklearn's transform interface, or a list of such objects.
                        The third element is optional and, if present, must be
                        a dictionary with the options to apply to the
                        transformation. Example: {'alias': 'day_of_week'}
    

    It is also clearly stated in the class definition that the features argument to DataFrameMapper is required to be a list of tuples, where the elements of the tuple may be lists.

    As a last note, as to why you actually get your error message: The LabelEncoder transformer in sklearn is meant for labeling purposes on 1D arrays. As such, it is fundamentally unable to handle 2 columns at once and will raise an Exception. So, if you want to use the LabelEncoder, you will have to build N tuples with 1 columnname and the transformer where N is the amount of columns you wish to transform.