I'm attempting to follow this tutorial on my own dataset. After binarizing my data, I tried to run the Binary Relevance package, but got the error: The number of classes has to be greater than one; got 1 class
These are the suggestions I've tried, with links:
Getting rid of categories with only one instance. This took my data from 34 labels to 32. I made sure to get rid of the two columns containing these genres. I also exploded the genres column (from a delimited string to just the genres) so that I could get rid of rows containing the sparsely seen genres.
Since I exploded the column, I could use a stratified test train split like you see here:
train, test = train_test_split(movies, random_state=42, train_size = 20000, test_size=1000, shuffle=True, stratify = movies['genre'])
I checked the length of the columns using len(np.unique(train['genre']))
which returned 32.
I checked whether np.unique(y_train)
returned 0 and 1, which it did, meaning I do not just have one class.
(EDIT) I also checked the shape of my x_train
and y_train
and got
x_train.shape
= (20000, 10000) (10,000 is my max number of parameters) and y_train.shape
= (20000, 32).
I'm beginning to think that the sparser categories are the issue, and not the code. I have over 300,000 rows, but my smallest categories have only 6 instances. It just not possible to use Binary Relevance to make predictions with such sparse cases, or is there another potential solution I'm missing?
The issue is with scikit-multilearn. It is not compatible with my version of Python (3.11) and does not integrate well with newer versions of numpy and scipy. Using scikit-multilearn-ng solved this issue.