I'm fine-tuning a SetFit model on a French dataset and following the guide in huggingface. They mention this point on the site that I didn't quite understand
"🌎 Multilingual support: SetFit can be used with any Sentence Transformer on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint."
Does that mean I must find an already finetuned SetFit model in French when loading the model? As in replace "paraphrase-mpnet-base-v2" below with a French one?
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
What the point in the guide suggests is that multilingual models fine-tuned using SetFit
method generalize well even on languages they did not see during the SetFit
fine-tuning process. This seems to be generally true for multilingual language models but it probably does not do any damage to mention it explicitly, particularly when discussing SetFit
, which is a method which usually works with a very small dataset (i.e. the dataset that might not be multilingual).
The finding is supported by the paper mentioned in the guide, where researchers show that model fine-tuned on English data using SetFit
performs well on variety of languages (see table 4).
What I would take from it is this: if you fine-tune multilingual checkpoint (e.g. sentence-transformers/paraphrase-multilingual-mpnet-base-v2
) and fine-tune it on French, it will perform well on French and probably will also perform well on other languages. If you plan to use the fine-tuned model only on French texts, you certainly can and try to fine-tune a specifically French model - however, it's certainly not true that you must do this.
However, if there exists a specifically French sentence transformer and you want to use your model only on French texts, I would recommend using the French model. Not because you must, but because it might perform better than the multilingual model.