I was wondering if we could get the attention scores of any multimodal model using the api provided by the hugging face library, as it's relatively easier to get such scores of normal language bert model, but what about lxmert? If anyone can answer this, it would help the understanding of such models.
I am not sure if this is true for all of the models, but most including LXMERT of them support the parameter output_attentions.
Example with CLIP:
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(
text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
)
outputs = model(**inputs, output_attentions=True)
print(outputs.keys())
print(outputs.text_model_output.keys())
print(outputs.vision_model_output.keys())
Output:
odict_keys(['logits_per_image', 'logits_per_text', 'text_embeds', 'image_embeds', 'text_model_output', 'vision_model_output'])
odict_keys(['last_hidden_state', 'pooler_output', 'attentions'])
odict_keys(['last_hidden_state', 'pooler_output', 'attentions'])