Search code examples

Use Quantization on HuggingFace Transformers models

I'm learning Quantization, and am experimenting with Section 1 of this notebook.

I want to use this code on my own models.

Hypothetically, I only need to assign to model variable in Section 1.2

# load model
model = BertForSequenceClassification.from_pretrained(configs.output_dir)

My models are from a different library: from transformers import pipeline. So .to() throws an AttributeError.

My Model:

pip install transformers
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')
model = unmasker("Hello I'm a [MASK] model.")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

How might I run the linked Quantization code on my example model?

Please let me know if there's anything else I should clarify in this post.


  • The pipeline approach won't work for Quantisation as we need the models to be returned. You can however, use pipeline for testing the original models for timing etc.

    Quantisation Code:

    token_logits contains the tensors of the quantised model.

    You could place a for-loop around this code, and replace model_name with string from a list.

    model_name = bert-base-uncased
    tokenizer = AutoTokenizer.from_pretrained(model_name )
    model = AutoModelForMaskedLM.from_pretrained(model_name)
    sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
    f"versions would help {tokenizer.mask_token} our carbon footprint."
    inputs = tokenizer(sequence, return_tensors="pt")
    mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
    token_logits = model(**inputs).logits
    # <- can stop here
