Spent a painful night debugging
import pandas as pd
from imblearn.over_sampling import RandomOverSampler
x_trainvec_rand, y_train_rand = RandomOverSampler(random_state=0).fit_resample(pd.DataFrame.sparse.from_spmatrix(x_trainvec), y_train)
print(x_trainvec_rand)
where x_trainvec is a csr sparse matrix and y_train is a pandas Dataframe, the dimensions of both in Dataframes are (75060 x 52651) and (75060 x 1), with the error 'ValueError: Shape of passed values is (290210, 1), indices imply (290210, 52651)'.
When suddenly I decided to try just
import pandas as pd
from imblearn.over_sampling import RandomOverSampler
x_trainvec_rand, y_train_rand = RandomOverSampler(random_state=0).fit_resample(x_trainvec, y_train)
print(x_trainvec_rand)
and somehow it worked.
Any ideas as to why?
Documentation says:
fit_resample(X, y)[source]
Resample the dataset.
Parameters
X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
y : array-like of shape (n_samples,)
Corresponding label for each sample in X.
The documentation says it accepts
X : {array-like, dataframe, sparse matrix}
That's sparse matrix
, not sparse dataframe. In the imbalaced-learn
source I found tests that the sparse type had to be csr
or csr
, but couldn't follow further processing.
But lets look at the pandas sparse.
A sparse matrix:
In [105]: M = sparse.csr_matrix(np.eye(3))
In [106]: M
Out[106]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
In [107]: print(M)
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
The derived dataframe:
In [108]: df = pd.DataFrame.sparse.from_spmatrix(M)
In [109]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 3 non-null Sparse[float64, 0]
1 1 3 non-null Sparse[float64, 0]
2 2 3 non-null Sparse[float64, 0]
dtypes: Sparse[float64, 0](3)
memory usage: 164.0 bytes
In [110]: df[1]
Out[110]:
0 0.0
1 1.0
2 0.0
Name: 1, dtype: Sparse[float64, 0]
In [111]: df[1].values
Out[111]:
[0, 1.0, 0]
Fill: 0
IntIndex
Indices: array([1], dtype=int32)
The sparse dataframe storage is entirely different from the sparse matrix. It's not a simple merger of the two classes.
I probably should have insisted on seeing the FULL traceback for the error,
ValueError: Shape of passed values is (290210, 1), indices imply (290210, 52651)
At least it might give us/you an idea of what it is trying to do. But on the other hand, focusing on what the documentation ACTUALLY says, rather than what you want it to say, is enough.