Search code examples
pythonnumpypandasscikit-learngrid-search

Can't set appropriate dtypes reading from a Numpy array


I would like to save some attributes of a dataframe and given a slice of the underlying numpy array, I would like to rebuild the dataframe as if I had taken a slice of the dataframe. If an object column has a value that can be coerced into a float, I can't figure out any method that would work. In the real dataset, I have millions of observations and several hundred columns.

The actual use case involves custom code where pandas interacts with scikit-learn. I know the latest build of scikit-learn has compatibility with pandas built in, but I am unable to use this version because the RandomizedSearchCV object cannot handle large parameter grids (this will be fixed in a future version).

data = [[2, 4, "Focus"],
        [3, 4, "Fiesta",],
        [1, 4, "300"],
        [7, 3, "Pinto"]]

# This dataframe is exactly as intended
df = pd.DataFrame(data=data)

# Slice a subset of the underlying numpy array
raw_slice = df.values[1:,:]

# Try using the dtype option to force dtypes
df_dtype = pd.DataFrame(data=raw_slice, dtype=df.dtypes)
print "\n Dtype arg doesn't use passed dtypes \n", df_dtype.dtypes

# Try converting objects to numeric after reading into dataframe
df_convert = pd.DataFrame(data=raw_slice).convert_objects(convert_numeric=True)
print "\n Convert objects drops object values that are not numeric \n", df_convert
[Out]
 Converted data does not use passed dtypes 
0    object
1    object
2    object
dtype: object

 Converted data drops object values that are not numeric 
   0  1    2
0  3  4  NaN
1  1  4  300
2  7  3  NaN

EDIT: Thank you @unutbu for the answer which precisely answered my question. In scikit-learn versions prior to 0.16.0, gridsearch objects stripped the underlying numpy array from the pandas dataframe. This meant that a single object column made the entire array an object and pandas methods could not be wrapped in custom transformers.

The solution, using @unutbu's answer is to make the first step of the pipeline a custom "DataFrameTransformer" object.

class DataFrameTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, X):
        self.columns = list(X.columns)
        self.dtypes = X.dtypes

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = pd.DataFrame(X, columns=self.columns)
        for col, dtype in zip(X, self.dtypes):
            X[col] = X[col].astype(dtype)
        return X

In the pipeline, just include your original df in the constructor:

pipeline = Pipeline([("df_converter", DataFrameTransformer(X)),
                      ...,
                     ("rf", RandomForestClassifier())])

Solution

  • If you are trying to save a slice of a DataFrame to disk, then a powerful and convenient way to do it is to use a pd.HDFStore. Note that this requires PyTables to be installed.

    # To save the slice `df.iloc[1:, :]` to disk:
    filename = '/tmp/test.h5'
    with pd.HDFStore(filename) as store:
        store['mydata'] = df.iloc[1:, :]
    
    # To load the DataFrame from disk:
    with pd.get_store(filename) as store:
        newdf2 = store['mydata']
        print(newdf2.dtypes)
        print(newdf2)
    

    yields

    0     int64
    1     int64
    2    object
    dtype: object
       0  1       2
    0  3  4  Fiesta
    1  1  4     300
    2  7  3   Pinto
    

    To reconstruct the sub-DataFrame from a NumPy array (of object dtype!) and df.dtypes, you could use

    import pandas as pd
    data = [[2, 4, "Focus"],
            [3, 4, "Fiesta",],
            [1, 4, "300"],
            [7, 3, "Pinto"]]
    
    # This dataframe is exactly as intended
    df = pd.DataFrame(data=data)
    
    # Slice a subset of the `values` numpy object array
    raw_slice = df.values[1:,:]
    
    newdf = pd.DataFrame(data=raw_slice)
    for col, dtype in zip(newdf, df.dtypes):
        newdf[col] = newdf[col].astype(dtype)
    print(newdf.dtypes)
    print(newdf)
    

    which yields the same result as above. However, if you are not saving raw_slice to disk, then you could simply keep a reference to df.iloc[1:, :] instead of converting the data to a NumPy array of object dtype -- a relatively inefficient data structure (in terms of both memory and performance).