Search code examples
python-3.xmultilingualhuggingface-transformers

Using huggingface transformers with a non English language


I have installed the latest version of transformers and I was able to use its simple syntax to make sentiment prediction of English phrases:

from transformers import pipeline
sentimentAnalysis = pipeline("sentiment-analysis")
print(sentimentAnalysis("Transformers piplines are easy to use"))
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…

[{'label': 'POSITIVE', 'score': 0.9305251240730286}]

print(sentimentAnalysis("Transformers piplines are extremely easy to use"))

[{'label': 'POSITIVE', 'score': 0.9820092916488647}]

However, when I tried it on a non-English language (here is Greek) I did not get the results I expected.

The following phrase translates in English as: 'This food is disgusting' and I would expect I very low sentiment score which is not what I got:

print(sentimentAnalysis("Αυτό το φαγητό είναι αηδιαστικό"))
[{'label': 'POSITIVE', 'score': 0.7899578213691711}]

Here is an attempt to use the best multilingual model:

enter image description here

Somewhat better but still widely out of target.

Is there something I can do about it?


Solution

  • The problem is that pipelines by default load an English model. In the case of sentiment analysis, this is distilbert-base-uncased-finetuned-sst-2-english, see here.

    Fortunately, you can just specify the exact model that you want to load, as described in the docs for pipeline:

    from transformers import pipeline
    pipe = pipeline("sentiment-analysis", model="<your_model_here>", tokenizer="<your_tokenizer_here>")
    

    Keep in mind that these need to be models compatible with the architecture of your respective task. The only greek model I could find was nlpaueb/bert-base-greek-uncased-v1, which seems like a base model to me. In that case, you'd first need to fine-tune your own model for sentiment analysis, and then could load from that checkpoint. Otherwise, you might get questionable results as well.