Search code examples
pythonnlphuggingface-transformerslarge-language-modelhuggingface-evaluate

How to compute sentence level perplexity from hugging face language models?


I have a large collection of documents each consisting of ~ 10 sentences. For each document, I wish to find the sentence that maximises perplexity, or equivalently the loss from a fine-tuned causal LM. I have decided to use Hugging Face and the distilgpt2 model for this purpose. I have 2 problems when trying to do in an efficient (vectorized) fashion:

  1. The tokenizer required padding to work in batch mode, but when computing the loss on padded input_ids those pad tokens are contributing to the loss. So the loss of a given sentence depends on the length of the longest sentence in the batch which is clearly wrong.

  2. When I pass a batch of input IDs to the model and compute the loss, I get a scalar as it (mean?) pools across the batch. I instead need the loss per item, not the pooled one.

I made a version that operates on a sentence by sentence basis and while correct, it is extremely slow (I want to process ~ 25m sentences total). Any advice?

Minimal example below:

# Init
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000")
segmenter = spacy.load('en_core_web_sm')

# That's the part I need to vectorise, surely within a document (bsize ~ 10)
# and ideally across documents (bsize as big as my GPU can handle)
def select_sentence(sentences):
    """We pick the sentence that maximizes perplexity"""
    max_loss, best_index = 0, 0
    for i, sentence in enumerate(sentences):
        encodings = tokenizer(sentence, return_tensors="pt")
        input_ids = encodings.input_ids
        loss = lm(input_ids, labels=input_ids).loss.item()
        if loss > max_loss:
            max_loss = loss
            best_index = i

    return sentences[best_index]

for document in documents:
    sentences = [sentence.text.strip() for sentence in segmenter(document).sents]
    best_sentence = select_sentence(sentences)
    write(best_sentence)


Solution

  • If the goal is to compute perplexity and then select the sentences, there's a better way to do the perplexity computation without messing around with tokens/models.

    Install https://huggingface.co/spaces/evaluate-metric/perplexity:

    pip install -U evaluate
    

    Then:

    perplexity = evaluate.load("perplexity", module_type="metric")
    input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
    
    results = perplexity.compute(model_id='gpt2',
                                 add_start_token=False,
                                 predictions=input_texts)
    print(list(results.keys()))
    
    

    [out]:

    >>>['perplexities', 'mean_perplexity']
    print(round(results["mean_perplexity"], 2))
    >>>646.75
    print(round(results["perplexities"][0], 2))
    >>>32.25
    

    Q: That's great but how do I use it for a custom model that can't be fetched with model_id=...?

    A: For that lets look under the hood, https://huggingface.co/spaces/evaluate-metric/perplexity/blob/main/perplexity.py

    This is how the code initialize the model:

    class Perplexity(evaluate.Metric):
        def _info(self):
            return evaluate.MetricInfo(
                module_type="metric",
                description=_DESCRIPTION,
                citation=_CITATION,
                inputs_description=_KWARGS_DESCRIPTION,
                features=datasets.Features(
                    {
                        "predictions": datasets.Value("string"),
                    }
                ),
                reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
            )
    
        def _compute(
            self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None
        ):
            ...
            model = AutoModelForCausalLM.from_pretrained(model_id)
            model = model.to(device)
    
            tokenizer = AutoTokenizer.from_pretrained(model_id)
            ...
    

    Argh, there's no support for local models!

    What if we do some simple changes to the code =)

    See Load a pre-trained model from disk with Huggingface Transformers

    
    class Perplexity(evaluate.Metric):
        def _info(self):
            return evaluate.MetricInfo(
                module_type="metric",
                description=_DESCRIPTION,
                citation=_CITATION,
                inputs_description=_KWARGS_DESCRIPTION,
                features=datasets.Features(
                    {
                        "predictions": datasets.Value("string"),
                    }
                ),
                reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
            )
    
        def _compute(
            self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None, local_file_only: bool = False
        ):
            ...
            model = AutoModelForCausalLM.from_pretrained(model_id, local_files_only=local_file_only)
            model = model.to(device)
    
            tokenizer = AutoTokenizer.from_pretrained(model_id, local_files_only=local_file_only)
    

    Technically, if you could load a local model that you can load with:

    AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000", local_file_only=True)
    

    you can should be able the model_id as such after the code change:

    perplexity.compute(model_id="clm-gpu/checkpoint-138000",
                                 add_start_token=False,
                                 predictions=input_texts, 
                                 local_file_only=True)
    

    Opened a pull-request: https://huggingface.co/spaces/evaluate-metric/perplexity/discussions/4