python pandas dataframe scikit-learn sklearn-pandas

Python sklearn-pandas Transform Multiple Columns at the same time error

I am using python with pandas and sklearn and trying to use the new and very convenient sklearn-pandas.

I have a big data frame and need to transform multiple columns in a similar way.

I have multiple column names in the variable other the source code documentation here states explicitly there is a possibility of transforming multiple columns with the same transformation, but the following code does not behave as expected:

from sklearn.preprocessing import MinMaxScaler, LabelEncoder

mapper = DataFrameMapper([[other[0],other[1]],LabelEncoder()])
mapper.fit_transform(df.copy())

I get the following error:

raise ValueError("bad input shape {0}".format(shape)) ValueError: ['EFW', 'BPD']: bad input shape (154, 2)

When I use the following code, it works great:

cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
mapper = DataFrameMapper(cols)
mapper.fit_transform(df.copy())

To my understanding, both should work well and yield same results. What am I doing wrong here?

Thanks!

Solution

The problem you encounter here, is that the two snippets of code are completely different in terms of data structure.

cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)] builds a list of tuples. Do note that you can shorten this line of code to:

cols = [(col, LabelEncoder()) for col in other]

Anyway, the first snippet, [[other[0],other[1]],LabelEncoder()] results in a list containing two elements: a list and a LabelEncoder instance. Now, it is documented that you can transform multiple columns through specifying:

Transformations may require multiple input columns. In these cases, the column names can be specified in a list:

mapper2 = DataFrameMapper([ (['children', 'salary'], sklearn.decomposition.PCA(1)) ])

This is a list containing tuple(list, object) structured elements, not list[list, object] structured elements.

If we take a look at the source code itself,

class DataFrameMapper(BaseEstimator, TransformerMixin):
    """
    Map Pandas data frame column subsets to their own
    sklearn transformation.
    """

    def __init__(self, features, default=False, sparse=False, df_out=False,
                 input_df=False):
        """
        Params:
        features    a list of tuples with features definitions.
                    The first element is the pandas column selector. This can
                    be a string (for one column) or a list of strings.
                    The second element is an object that supports
                    sklearn's transform interface, or a list of such objects.
                    The third element is optional and, if present, must be
                    a dictionary with the options to apply to the
                    transformation. Example: {'alias': 'day_of_week'}

It is also clearly stated in the class definition that the features argument to DataFrameMapper is required to be a list of tuples, where the elements of the tuple may be lists.

As a last note, as to why you actually get your error message: The LabelEncoder transformer in sklearn is meant for labeling purposes on 1D arrays. As such, it is fundamentally unable to handle 2 columns at once and will raise an Exception. So, if you want to use the LabelEncoder, you will have to build N tuples with 1 columnname and the transformer where N is the amount of columns you wish to transform.