I am using python with pandas
and sklearn
and trying to use the new and very convenient sklearn-pandas
.
I have a big data frame and need to transform multiple columns in a similar way.
I have multiple column names in the variable other
the source code documentation here
states explicitly there is a possibility of transforming multiple columns with the same transformation, but the following code does not behave as expected:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
mapper = DataFrameMapper([[other[0],other[1]],LabelEncoder()])
mapper.fit_transform(df.copy())
I get the following error:
raise ValueError("bad input shape {0}".format(shape)) ValueError: ['EFW', 'BPD']: bad input shape (154, 2)
When I use the following code, it works great:
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
mapper = DataFrameMapper(cols)
mapper.fit_transform(df.copy())
To my understanding, both should work well and yield same results. What am I doing wrong here?
Thanks!
The problem you encounter here, is that the two snippets of code are completely different in terms of data structure.
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
builds a list of tuples. Do note that you can shorten this line of code to:
cols = [(col, LabelEncoder()) for col in other]
Anyway, the first snippet, [[other[0],other[1]],LabelEncoder()]
results in a list containing two elements: a list and a LabelEncoder
instance. Now, it is documented that you can transform multiple columns through specifying:
Transformations may require multiple input columns. In these cases, the column names can be specified in a list:
mapper2 = DataFrameMapper([ (['children', 'salary'], sklearn.decomposition.PCA(1)) ])
This is a list
containing tuple(list, object)
structured elements, not list[list, object]
structured elements.
If we take a look at the source code itself,
class DataFrameMapper(BaseEstimator, TransformerMixin):
"""
Map Pandas data frame column subsets to their own
sklearn transformation.
"""
def __init__(self, features, default=False, sparse=False, df_out=False,
input_df=False):
"""
Params:
features a list of tuples with features definitions.
The first element is the pandas column selector. This can
be a string (for one column) or a list of strings.
The second element is an object that supports
sklearn's transform interface, or a list of such objects.
The third element is optional and, if present, must be
a dictionary with the options to apply to the
transformation. Example: {'alias': 'day_of_week'}
It is also clearly stated in the class definition that the features argument to DataFrameMapper
is required to be a list of tuples, where the elements of the tuple may be lists.
As a last note, as to why you actually get your error message: The LabelEncoder
transformer in sklearn
is meant for labeling purposes on 1D arrays. As such, it is fundamentally unable to handle 2 columns at once and will raise an Exception. So, if you want to use the LabelEncoder
, you will have to build N tuples with 1 columnname and the transformer where N is the amount of columns you wish to transform.