Search code examples
pythonhuggingface-transformersfew-shot-learning

setfit training with a pandas dataframe


I would like to train a zero shot classifier on an annotated sample dataset.

I am following some tutorials but as all use their own data and the same pretarined model, I am trying to confirm: Is this the best approach?

Data example: 

import pandas as pd
from datasets import Dataset
    
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
    {'text': 'The product is great and works well.', 'label': 'Product Performance'},
    {'text': 'I love the design of the product.', 'label': 'Product Design'},
    {'text': 'The product is difficult to use.', 'label': 'Usability'},
    {'text': 'The customer service was very helpful.', 'label': 'Customer Service'},
    {'text': 'The product was delivered on time.', 'label': 'Delivery Time'}
]

# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)

# convert to Dataset format
df = Dataset.from_pandas(df)

By having the previous data format, this is the approach for model finetunning:

from setfit import SetFitModel, SetFitTrainer

# Select a model
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# training with Setfit
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # to keep the code simple I do not create the df_train
    eval_dataset=df, # to keep the code simple I do not create the df_eval
    column_mapping={"text": "text", "label": "label"} 
)

trainer.train()

The issue here is that the process never ends after more than 500 hours in a laptop, and the dataset it is only about 88 records with 11 labels.


Solution

  • I tried to run the example you posted on Google Colab, it took 37 seconds to run the training.

    Here's you code with some tweak to make it work on Colab:

    ### Install libraries
    %%capture
    !pip install datasets setfit
    

    After installing the libraries, run the following code:

    ### Import dataset
    import pandas as pd
    from datasets import Dataset
    # Sample feedback data, it will have 8 samples per label
    feedback_dict = [
        {'text': 'The product is great and works well.', 'label': 'Product Performance'},
        {'text': 'I love the design of the product.', 'label': 'Product Design'},
        {'text': 'The product is difficult to use.', 'label': 'Usability'},
        {'text': 'The customer service was very helpful.', 'label': 'Customer Service'},
        {'text': 'The product was delivered on time.', 'label': 'Delivery Time'}
    ]
    # Create a DataFrame with the feedback data
    df = pd.DataFrame(feedback_dict)
    # convert to Dataset format
    df = Dataset.from_pandas(df)
    
    ### Run training
    from setfit import SetFitModel, SetFitTrainer
    # Select a model
    model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
    # training with Setfit
    trainer = SetFitTrainer(
        model=model,
        train_dataset=df, # to keep the code simple I do not create the df_train
        eval_dataset=df, # to keep the code simple I do not create the df_eval
        column_mapping={"text": "text", "label": "label"} 
    )
    trainer.train()
    

    And finally, you can download the trained model on drive and then download it on you PC manually.

    ### Download model to drive
    from google.colab import drive
    drive.mount('/content/drive')
    trainer.model._save_pretrained('/content/drive/path/to/target/folder')
    

    If your main issue is the training time, this should fix it.