I'm trying to create an image captioning model using hugging face blip2 model on colab. My code was working fine till last week (Nov 8) but it gives me an exception now.
To install packages I use the following command:
!pip install -q git+https://github.com/huggingface/peft.git transformers bitsandbytes datasets
To load blip2 processor and model I use the following code:
model_name = "Salesforce/blip2-opt-2.7b"
processor = AutoProcessor.from_pretrained(model_name)
model = Blip2ForConditionalGeneration.from_pretrained(model_name,device_map="auto",load_in_8bit=False)
I use the following code to generate captions:
def generate_caption(processor, model, image_path):
image = PILImage.open(image_path).convert("RGB")
print("image shape:" + image.size)
device = "cuda" if torch.cuda.is_available() else "cpu"
# Preprocess the image
inputs = processor(images=image, return_tensors="pt").to(device)
print("Input shape:", inputs['pixel_values'].shape)
print("Device:", device) # Additional debugging
for key, value in inputs.items():
print(f"Key: {key}, Shape: {value.shape}")
# Generate caption
with torch.no_grad():
generated_ids = model.generate(**inputs)
caption = processor.decode(generated_ids[0], skip_special_tokens=True)
return caption
here is the code that uses this method to generate captions:
image_path = "my_image_path.jpg"
caption = generate_caption(processor, model, image_path)
print(f"{image_path}: {caption}"
finally, this is the outputs and errors of running the code above:
image shape: (320, 240)
Input shape: torch.Size([1, 3, 224, 224])
Device: cuda
Key: pixel_values, Shape: torch.Size([1, 3, 224, 224])
---------------------------------------------------------------------------
.
.
.
/usr/local/lib/python3.10/dist-packages/transformers/models/blip_2/modeling_blip_2.py in generate(self, pixel_values, input_ids, attention_mask, interpolate_pos_encoding, **generate_kwargs)
2314 if getattr(self.config, "image_token_index", None) is not None:
2315 special_image_mask = (input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
-> 2316 inputs_embeds[special_image_mask] = language_model_inputs.flatten()
2317 else:
2318 logger.warning_once(
RuntimeError: shape mismatch: value tensor of shape [81920] cannot be broadcast to indexing result of shape [0]
I have searched the internet and used various AI models for help but to no avail. My guess is that this is a package update problem since my code had no problem last week. (I tried to restore my code to Nov 8 version but it throws an exception.) Moreover, I don't understand how 81920 is calculated in the error message.
I had the same issue. You need to add a prompt in the processor:
prompt = " "
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device="cuda", dtype=torch.float16)
Hope it helps.