Search code examples
pythonpytorchparallel-processingpathlib

Python is running an io task in paralell possible, which requires a for loop inevitably?


let me clarify my problem with bullet points.

  • In PyTorch(or whatever the reason), I'm making a Dataset class
  • I need to get all paths of the images in folder
  • but the images are structured under several subfolders
  • So the actual code I'm using is
p = Path(self.root_dir) / "Training" if self.is_train else "Validation"

image_p = p / "01.원천데이터" / f"{"T" if self.is_train else "V"}S_images"
label_p = p / "02.라벨링데이터" / f"{"T" if self.is_train else "V"}L_labels"

# set the return lists
image_path_list = []
label_list = []

# get image paths
for sentence_dir in image_p.glob("*"):  # only have several subfolders
    for true_false_dir in sentence_dir.glob("*"):  # only have several subfolders too                
        for posture_dir in true_false_dir.glob("*"):  # only have several subfolders again
            image_path = sorted(list(posture_dir.glob("*")))[-1]  # in 'posture_dir', there are images, but I need only the last one
            image_path_list.append(str(image_path))
  • What is important here, is that I need to go deep down to the very bottom of the subfolders to get the actual path of an image

Is there any way I could make this execution faster?

Most of the resources about multiprocessing or multithreading seem to have a concept of vectorising a function with a list type of argument passed, but not sure if that would fit with my situation now...


Solution

  • image_path_list = list(image_p.glob("**/*"))
    

    The glob pattern ** matches any depth. This is pretty fast, as it is fully implemented in C, and should not need parallelisation (especially compared to anything else you are likely to do with the result).

    In general, you might want to iterate over the generator instead of converting it into a list if the list would consume inordinate amount of memory though; but since you say you are creating a pytorch Dataset, which needs to implement random access via __getitem__, it might not be the best approach here; straight up list works better.