let me clarify my problem with bullet points.
PyTorch
(or whatever the reason), I'm making a Dataset
classp = Path(self.root_dir) / "Training" if self.is_train else "Validation"
image_p = p / "01.원천데이터" / f"{"T" if self.is_train else "V"}S_images"
label_p = p / "02.라벨링데이터" / f"{"T" if self.is_train else "V"}L_labels"
# set the return lists
image_path_list = []
label_list = []
# get image paths
for sentence_dir in image_p.glob("*"): # only have several subfolders
for true_false_dir in sentence_dir.glob("*"): # only have several subfolders too
for posture_dir in true_false_dir.glob("*"): # only have several subfolders again
image_path = sorted(list(posture_dir.glob("*")))[-1] # in 'posture_dir', there are images, but I need only the last one
image_path_list.append(str(image_path))
Is there any way I could make this execution faster?
Most of the resources about multiprocessing
or multithreading
seem to have a concept of vectorising a function with a list type of argument passed, but not sure if that would fit with my situation now...
image_path_list = list(image_p.glob("**/*"))
The glob pattern **
matches any depth. This is pretty fast, as it is fully implemented in C, and should not need parallelisation (especially compared to anything else you are likely to do with the result).
In general, you might want to iterate over the generator instead of converting it into a list if the list would consume inordinate amount of memory though; but since you say you are creating a pytorch Dataset
, which needs to implement random access via __getitem__
, it might not be the best approach here; straight up list works better.