Search code examples
machine-learningconv-neural-networkopenai-apilarge-language-modelword-embedding

How to get multimodal embeddings from CLIP model?


I'm hoping to use CLIP to get a single embedding for rows of multimodal (image and text) data.

Say I have the following model:

from PIL import Image
import torch
from transformers import CLIPProcessor, CLIPModel
import torchvision.transforms as transforms

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def convert_image_data_to_tensor(image_data):
    return torch.tensor(image_data)

dataset = df[['image_data', 'text_data']].to_dict('records')

embeddings = []
for data in dataset:
    image_tensor = convert_image_data_to_tensor(data['image_data'])
    text = data['text_data']

    inputs = processor(text=text, images=image_tensor, return_tensors=True)
    with torch.no_grad():
        output = model(**inputs)

I want to get the embeddings calculated in output. I know that output has the addtributes text_embeddings and image_embeddings, but I'm not sure how they interact later on. If I want to get a single embedding for each record, should I just be concatenating these attributes together? Is there another attribute that combines the two in some other way?

These are the attributes stored in output:

print(dir(output))

['__annotations__', '__class__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'image_embeds', 'items', 'keys', 'logits_per_image', 'logits_per_text', 'loss', 'move_to_end', 'pop', 'popitem', 'setdefault', 'text_embeds', 'text_model_output', 'to_tuple', 'update', 'values', 'vision_model_output']

Also, is there a way to specify the size of the embedding that CLIP outputs? Similar to how you can specify the embedding size in BERT configs?

Thanks in advance for any help here. Feel free to correct me if I'm misunderstanding anything critical here.


Solution

  • CLIP is trained such that the text and image embeddings are projected on to a shared latent space. In fact, image-text similarity is what the model is trained to optimise.

    So a very typical use case of CLIP is to compare and match images and text based on similarity. In your case, you don't seem to be interested in any measure of similarity. You already have an image and the text and want some joint embedding representation. So concatenation of the two embeddings the way you described it is fine. An alternative would be take their mean (since they are in the same embedding space, it's fine to do this).

    As for the size of the embedding, I don't think there is a way to change it as it's hardwired into the architecture of the model when it's trained. You can perhaps employ a dimensionality reduction technique, or fine tune the model after stacking another fully connected layer with the dimensionality of your choice.