Search code examples
pythonpandasdataframescipyimblearn

Why does the pandas Dataframe version of a sparse matrix not work with RandomOverSampler from imblearn when the documentation says it accepts both?


Spent a painful night debugging

import pandas as pd
from imblearn.over_sampling import RandomOverSampler


x_trainvec_rand, y_train_rand = RandomOverSampler(random_state=0).fit_resample(pd.DataFrame.sparse.from_spmatrix(x_trainvec), y_train)   

print(x_trainvec_rand)

where x_trainvec is a csr sparse matrix and y_train is a pandas Dataframe, the dimensions of both in Dataframes are (75060 x 52651) and (75060 x 1), with the error 'ValueError: Shape of passed values is (290210, 1), indices imply (290210, 52651)'.

When suddenly I decided to try just

import pandas as pd
from imblearn.over_sampling import RandomOverSampler


x_trainvec_rand, y_train_rand = RandomOverSampler(random_state=0).fit_resample(x_trainvec, y_train)   

print(x_trainvec_rand)

and somehow it worked.

Any ideas as to why?

Documentation says:

fit_resample(X, y)[source]
Resample the dataset.

Parameters
X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Matrix containing the data which have to be sampled.

y : array-like of shape (n_samples,)
Corresponding label for each sample in X.

Solution

  • The documentation says it accepts

    X : {array-like, dataframe, sparse matrix}
    

    That's sparse matrix, not sparse dataframe. In the imbalaced-learn source I found tests that the sparse type had to be csr or csr, but couldn't follow further processing.

    But lets look at the pandas sparse.

    A sparse matrix:

    In [105]: M = sparse.csr_matrix(np.eye(3))
    In [106]: M
    Out[106]: 
    <3x3 sparse matrix of type '<class 'numpy.float64'>'
        with 3 stored elements in Compressed Sparse Row format>
    In [107]: print(M)
      (0, 0)    1.0
      (1, 1)    1.0
      (2, 2)    1.0
    

    The derived dataframe:

    In [108]: df = pd.DataFrame.sparse.from_spmatrix(M)
    In [109]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 3 columns):
     #   Column  Non-Null Count  Dtype             
    ---  ------  --------------  -----             
     0   0       3 non-null      Sparse[float64, 0]
     1   1       3 non-null      Sparse[float64, 0]
     2   2       3 non-null      Sparse[float64, 0]
    dtypes: Sparse[float64, 0](3)
    memory usage: 164.0 bytes
    In [110]: df[1]
    Out[110]: 
    0    0.0
    1    1.0
    2    0.0
    Name: 1, dtype: Sparse[float64, 0]
    In [111]: df[1].values
    Out[111]: 
    [0, 1.0, 0]
    Fill: 0
    IntIndex
    Indices: array([1], dtype=int32)
    

    The sparse dataframe storage is entirely different from the sparse matrix. It's not a simple merger of the two classes.

    I probably should have insisted on seeing the FULL traceback for the error,

     ValueError: Shape of passed values is (290210, 1), indices imply (290210, 52651)
    

    At least it might give us/you an idea of what it is trying to do. But on the other hand, focusing on what the documentation ACTUALLY says, rather than what you want it to say, is enough.