python pandas scikit-learn sklearn-pandas

pandas Selecting columns on the basis of dtype

I have a pandas dataframe df with many columns i only want to process the columns with object datatype for that i had tried

from sklearn.preprocessing import FunctionTransformer
get_cat=FunctionTransformer(lambda x:x if x.dtype==np.dtype(object) else None,validate=False)
get_cat.fit_transform(df)

but i am getting error

AttributeError: 'DataFrame' object has no attribute 'dtype'

but if i do same operation with column names as

get_cat=FunctionTransformer(lambda x:x[[col_names]],validate=False)

its working fine. i am using Function transformer to get the data in sklearn Pipline for machine learning.

Solution

I think it's easier/clearer to build a custom transformer. Additionally, it can be easily applied in a pipeline

It could look like this:

class SelectDtypeColumnsTransfomer(TransformerMixin):

    def __init__(self, dtype=object):
        self.dtype = dtype

    def transform(self, X, **transform_params):
        """ X : pandas DataFrame """

        columns = X.columns[X.dtypes == self.dtype]
        trans = X[columns].copy()
        return trans

    def fit(self, X, y=None, **fit_params):
        return self

An example:

df = pd.DataFrame({'A':[1, 2], 'B': ['s', 'd'], 'c':['test', 'r']})
print(SelectDtypeColumnsTransfomer(np.int64).transform(df))
   A
0  1
1  2
print(SelectDtypeColumnsTransfomer(object).transform(df))
   B     c
0  s  test
1  d     r

Concerning the use in pipelines:

You should ensure that the columns in training and test set have the same dtypes. Depending on how you preprocess the data it might be that e.g. in the training set a column is of type float (including a nan) and in the test set it's of type int (no nan), or vice versa. In that case you need to adapt the fit function which sholud fix the columns during fitting and make further considerations ensuring consistent dtypes in the following steps of the pipeline