Search code examples
amazon-web-servicesfacebookamazon-sagemakerhuggingface-transformersmachine-translation

How to specify a forced_bos_token_id when using Facebook's M2M-100 HuggingFace model through AWS SageMaker?


The model page provides this snippet for how the model should be used:

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_1.2B")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")

# translate Hindi to French
tokenizer.src_lang = "hi"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."

# translate Chinese to English
tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Life is like a box of chocolate."

It also provides a snippet for how to deploy and use it with AWS SageMaker:

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'facebook/m2m100_1.2B',
    'HF_TASK':'text2text-generation'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.6.1',
    pytorch_version='1.7.1',
    py_version='py36',
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.m5.xlarge' # ec2 instance type
)

predictor.predict({
    'inputs': "The answer to the universe is"
})

It is not clear, however, how to specify the source language or the target language with the AWS setup. I tried:

predictor.predict({
    'inputs': "The answer to the universe is",
    'forced_bos_token_id': "fr"
})

but my parameter was ignored.

I haven't managed to find any documentation that would explain what the expect format is through this API.


Solution

  • The tokenizer needs to be installed and imported in any case:

    pip install transformers
    pip install sentencepiece
    

    Then the tokenizer needs to be passed the following way:

    tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")
    predictor.predict({
        'inputs': "The answer to the universe is",
        'parameters': {
            'forced_bos_token_id': tokenizer.get_lang_id("it")
        }
    })