Search code examples
pythonpytorchmultilingualbert-language-model

Bert-multilingual in pytorch


I am using bert embedding for french text data. and I have problem with loading model and vocabulary.

I used the following code for tokenization that works well, but to get the vocabulary, it gives me Chinese words!!

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
text = "La Banque Nationale du Canada fête cette année le 110e anniversaire de son bureau de Paris."
marked_text = "[CLS] " + text + " [SEP]"
tokenized_text = tokenizer.tokenize(marked_text)
list(tokenizer.vocab.keys())[5000:5020]

I expected french words in vocabulary but i get chinese words, should i specify the language somewhere in the code?


Solution

  • You are Getting Chinese text because, you are looking for a specific range of the words from the vocabulary [5000:5020], which corresponds to the Chinese text. Also,bert -base-multilingual-cased is trained on 104 languages.

    If you further want to verify your code, you can use this:

    tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
    text = "La Banque Nationale du Canada fête cette année le 110e anniversaire de son bureau de Paris."
    marked_text = "[CLS] " + text + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)
    

    which is same as your code, followed by:

    token_no=[]
    for token in tokenized_text:
        #print(tokenizer.vocab[token]) ### you can use this to check the corresponding index of the token
        token_no.append(tokenizer.vocab[token])
    
    
    ### The below code obtains the tokens from the index, which is similar to what you were trying, but on the correct range.
    new_token_list=[]
    for i in token_no:
        new_token_list.append(list(tokenizer.vocab.keys())[i])
    
    #print(new_token_list); ### you can use it if you want to check back the tokens.