python dataset data-preprocessing huggingface-datasets stable-diffusion

Convert dictionary to datasets.arrow_dataset.Dataset

I'm trying to use the Pokemon finetuning notebook, which uses the Pokemon BLIP captions dataset; see the GitHub from the Lambda Labs examples repo; the training code is in the justinpinkney/stable-diffusion code base. I want to fine-tune Stable Diffusion on the MuMu dataset of album covers.

I have a (N, 512, 512, 3) numpy array of images and a (N) list of caption strings. The original code base works with a <class 'datasets.arrow_dataset.Dataset'> object, so I attempt to convert my dataset to this format using datasets.Dataset.from_dict() within hf_dataset() in ldm/data/simple.py:

img_dict = {}
for i in range(len(img_tensor)):
    img_dict[i] = { 'image': img_tensor[i], 'text': img_captions[i] }
from datasets.Dataset import from_dict
ds = from_dict(img_dict)

This produces a huge error traceback ending in:

File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: arrays to be concatenated must be identically typed, but list<item: list<item: uint8>> and string were encountered.

I think the problem is that img_tensor[i] is a 2D array (a list of lists of uint8 entries) and img_captions[i] is a string. How can I convert my data to a datasets.arrow_dataset.Dataset object?

Solution

Solved it, wasn't reading the documentation carefully enough. I have a dictionary of integer keys with dictionary values, but I need "a mapping of strings to Arrays or Python lists" as per the documentation. Here's a toy example:

img_tensor = np.zeros((100, 512, 512, 3))
captions = []
for i in range(100):
  captions.append('hello')
captioned_imgs = {
    'images': img_tensor,
    'text': captions
}
from datasets import Dataset
out = Dataset.from_dict(captioned_imgs)
print(type(out))

Output: <class 'datasets.arrow_dataset.Dataset'>