python deep-learning pytorch multilingual bert-language-model

Multilingual Bert sentence vector captures language used more than meaning - working as interned?

Playing around with BERT, I downloaded the Huggingface Multilingual Bert and entered three sentences, saving their sentence vectors (the embedding of [CLS]), then translated them via Google Translate, passed them through the model and saved their sentence vectors.

I then compared the results using cosine similarity.

I was surprised to see that each sentence vector was pretty far from the one generated from the sentence translated from it (0.15-0.27 cosine distance) while different sentences from the same language were quite close indeed (0.02-0.04 cosine distance).

So instead of having sentences of similar meaning (but different languages) grouped together (in 768 dimensional space ;) ), dissimilar sentences of the same language are closer.

To my understanding the whole point of Multilingual Bert is inter-language transfer learning - for example training a model (say, and FC net) on representations in one language and having that model be readily used in other languages.

How can that work if sentences (of different languages) of the exact meaning are mapped to be more apart than dissimilar sentences of the same language?

My code:

import torch

import transformers
from transformers import AutoModel,AutoTokenizer

bert_name="bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(bert_name)
MBERT = AutoModel.from_pretrained(bert_name)

#Some silly sentences
eng1='A cat jumped from the trees and startled the tourists'
e=tokenizer.encode(eng1, add_special_tokens=True)
ans_eng1=MBERT(torch.tensor([e]))

eng2='A small snake whispered secrets to large cats'
t=tokenizer.tokenize(eng2)
e=tokenizer.encode(eng2, add_special_tokens=True)
ans_eng2=MBERT(torch.tensor([e]))

eng3='A tiger sprinted from the bushes and frightened the guests'
e=tokenizer.encode(eng3, add_special_tokens=True)
ans_eng3=MBERT(torch.tensor([e]))

# Translated to Hebrew with Google Translate
heb1='חתול קפץ מהעץ והבהיל את התיירים'
e=tokenizer.encode(heb1, add_special_tokens=True)
ans_heb1=MBERT(torch.tensor([e]))

heb2='נחש קטן לחש סודות לחתולים גדולים'
e=tokenizer.encode(heb2, add_special_tokens=True)
ans_heb2=MBERT(torch.tensor([e]))

heb3='נמר רץ מהשיחים והפחיד את האורחים'
e=tokenizer.encode(heb3, add_special_tokens=True)
ans_heb3=MBERT(torch.tensor([e]))


from scipy import spatial
import numpy as np

# Compare Sentence Embeddings

result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_heb1[1].data.numpy())

print ('Eng1-Heb1 - Translated sentences',result)


result = spatial.distance.cosine(ans_eng2[1].data.numpy(), ans_heb2[1].data.numpy())

print ('Eng2-Heb2 - Translated sentences',result)

result = spatial.distance.cosine(ans_eng3[1].data.numpy(), ans_heb3[1].data.numpy())

print ('Eng3-Heb3 - Translated sentences',result)

print ("\n---\n")

result = spatial.distance.cosine(ans_heb1[1].data.numpy(), ans_heb2[1].data.numpy())

print ('Heb1-Heb2 - Different sentences',result)

result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_eng2[1].data.numpy())

print ('Heb1-Heb3 - Similiar sentences',result)

print ("\n---\n")

result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_eng2[1].data.numpy())

print ('Eng1-Eng2 - Different sentences',result)

result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_eng3[1].data.numpy())

print ('Eng1-Eng3 - Similiar sentences',result)

#Output:
"""
Eng1-Heb1 - Translated sentences 0.2074061632156372
Eng2-Heb2 - Translated sentences 0.15557605028152466
Eng3-Heb3 - Translated sentences 0.275478720664978

---

Heb1-Heb2 - Different sentences 0.044616520404815674
Heb1-Heb3 - Similar sentences 0.027982771396636963

---

Eng1-Eng2 - Different sentences 0.027982771396636963
Eng1-Eng3 - Similar sentences 0.024596810340881348
"""

P.S.

At least the Heb1 was closer to Heb3 than to Heb2. This was also observed for the English equivalents, but less so.

Solution

The [CLS] Token somehow represents the input sequence, but how exactly is difficult to say. The language is of course an important characteristic of a sentence, probably more than meaning. BERT is a pretrained model which tries to model such characteristics as meaning, structure and also language. If you want to have a model, which helps you identify if two sentences of different language mean the same thing, I can think of two different approaches:

approach: You can train a classifier (SVM, logistic Regression or even some neuronal nets such as CNN) on that task. Inputs: two [CLS]-Token, Output: Same meaning, or not same meaning. As training data, you could choose [CLS]-Token-pairs of sentences of different language which are either of the same meaning or not. To get meaningful results, you would need a lot of such sentence pairs. Luckily you can either generate them via google translate, or use a parallel texts such as the bible which exists in a lot of languages, and extract sentence pairs from there.
approach: Fine-tune the bert model on exactly that task: As in the previous approach, you need a lot of training data. A sample input to the BERT model would look like that: A cat jumped from the trees and startled the tourists [SEP] חתול קפץ מהעץ והבהיל את התיירים

To classify if those sentences are of the same meaning, you would add a classification layer on top of the [CLS]-Token and Fine tune the whole Model on that task.

Note: I have never worked with a multilingual BERT-model, those approaches are what comes to my mind to accomplish the mentioned task. If you try those approaches, I would be interested to know how they perform 😊.