python classification huggingface-transformers

huggingface transformers classification using num_labels 1 vs 2

question 1)

The answer to this question suggested that for a binary classification problem I could use num_labels as 1 (positive or not) or 2 (positive and negative). Is there any guideline regarding which setting is better? It seems that if we use 1 then probability would be calculated using sigmoid function and if we use 2 then probabilities would be calculated using softmax function.

question 2)

In both cases are my y labels going to be same? each data point will have 0 or 1 and not one hot encoding? For example, if I have 2 data points then y would be 0,1 and not [0,0],[0,1]

I have very unbalanced classification problem where class 1 is present only 2% of times. In my training data I am oversampling

question 3)

My data is in pandas dataframe and I am converting it to a dataset and creating y variable using below. How should I cast my y column - label if I am planning to use num_labels=1?

`train_dataset=Dataset.from_pandas(train_df).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))`

Solution

Well, it probably is kind of late. But I want to point out one thing, according to the Hugging Face code, if you set num_labels = 1, it will actually trigger the regression modeling, and the loss function will be set to MSELoss(). You can find the code here.

Also, in their own tutorial, for a binary classification problem (IMDB, positive vs. negative), they set num_labels = 2.

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Here is the link.