Search code examples
image-processinghuggingface-transformersbert-language-modeltransformer-modelattention-model

how can we get the attention scores of multimodal models via hugging face library?


I was wondering if we could get the attention scores of any multimodal model using the api provided by the hugging face library, as it's relatively easier to get such scores of normal language bert model, but what about lxmert? If anyone can answer this, it would help the understanding of such models.


Solution

  • I am not sure if this is true for all of the models, but most including LXMERT of them support the parameter output_attentions.

    Example with CLIP:

    from PIL import Image
    import requests
    from transformers import CLIPProcessor, CLIPModel
    
    model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    inputs = processor(
        text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
    )
    
    outputs = model(**inputs, output_attentions=True)
    print(outputs.keys())
    print(outputs.text_model_output.keys())
    print(outputs.vision_model_output.keys())
    

    Output:

    odict_keys(['logits_per_image', 'logits_per_text', 'text_embeds', 'image_embeds', 'text_model_output', 'vision_model_output'])
    odict_keys(['last_hidden_state', 'pooler_output', 'attentions'])
    odict_keys(['last_hidden_state', 'pooler_output', 'attentions'])