I'm trying to use the Pokemon finetuning notebook, which uses the Pokemon BLIP captions dataset; see the GitHub from the Lambda Labs examples repo; the training code is in the justinpinkney/stable-diffusion code base. I want to fine-tune Stable Diffusion on the MuMu dataset of album covers.
I have a (N, 512, 512, 3) numpy array of images and a (N) list of caption strings. The original code base works with a <class 'datasets.arrow_dataset.Dataset'> object, so I attempt to convert my dataset to this format using datasets.Dataset.from_dict() within hf_dataset() in ldm/data/simple.py:
img_dict = {}
for i in range(len(img_tensor)):
img_dict[i] = { 'image': img_tensor[i], 'text': img_captions[i] }
from datasets.Dataset import from_dict
ds = from_dict(img_dict)
This produces a huge error traceback ending in:
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: arrays to be concatenated must be identically typed, but list<item: list<item: uint8>> and string were encountered.
I think the problem is that img_tensor[i] is a 2D array (a list of lists of uint8 entries) and img_captions[i] is a string. How can I convert my data to a datasets.arrow_dataset.Dataset object?
Solved it, wasn't reading the documentation carefully enough. I have a dictionary of integer keys with dictionary values, but I need "a mapping of strings to Arrays or Python lists" as per the documentation. Here's a toy example:
img_tensor = np.zeros((100, 512, 512, 3))
captions = []
for i in range(100):
captions.append('hello')
captioned_imgs = {
'images': img_tensor,
'text': captions
}
from datasets import Dataset
out = Dataset.from_dict(captioned_imgs)
print(type(out))
Output: <class 'datasets.arrow_dataset.Dataset'>