Search code examples
pythonoptimizationpython-os

find sub folders which contain images


What is the most efficent way to get path of subfolders which contain files. For example, if this is my input structure.

inputFolder    
│
└───subFolder1
│   │
│   └───subfolder11
│       │   file1.jpg
│       │   file2.jpg
│       │   ...
│   
└───folder2
    │   file021.jpg
    │   file022.jpg

If I pass getFolders(inputPath), it should return the output as a list of folders containig images ['inputFolder/subFolder1/subFolder11','inputFolder/folder2']

Currently I'm making use of my library TreeHandler, which is just a wrapper of os.walk to get all the files.

import os
from treeHandler import treeHandler
th=treeHandler()
tempImageList=th.getFiles(path,['jpg'])
### basically tempImageList will be list of path of all files with '.jpg' extension

### now is the filtering part,the line which requires optimisation.
subFolderList=list(set(list(map(lambda x:os.path.join(*x.split('/')[:-1]),tempImageList))))

I think it can be done more efficiently.

Thanks in advance


Solution

    • Splitting all the parts of a path and re-joining them seems to reduce efficiency.
    • Finding the index of the last instance of '/' and slicing works much faster.

      def remove_tail(path):
          index = path.rfind('/') # returns index of last appearance of '/' or -1 if not present
          return (path[:index] if index != -1  else '.') # return . for parent directory
      .
      .
      .
      subFolderList = list(set([remove_tail(path) for path in tempImageList]))
      
    • Verified on AWA2 dataset folders (50 folders and 37,322 images).

    • Observed about 3 times faster result.
    • Readability enhanced using the list comprehension.
    • Handled case where the parent directory has images (which would result in an error with the existing implementation)

    Adding the code used for verification

    import os
    from treeHandler import treeHandler
    import time
    
    def remove_tail(path):
        index = path.rfind('/')
        return (path[:index] if index != -1  else '.')
    
    th=treeHandler()
    tempImageList= th.getFiles('JPEGImages',['jpg'])
    tempImageList = tempImageList
    ### basically tempImageList will be list of path of all files with '.jpg' extension
    
    ### now is the filtering part,the line which requires optimisation.
    print(len(tempImageList))
    start = time.time()
    originalSubFolderList=list(set(list(map(lambda x:os.path.join(*x.split('/')[:-1]),tempImageList))))
    print("Current method takes", time.time() - start)
    
    start = time.time()
    newSubFolderList = list(set([remove_tail(path) for path in tempImageList]))
    print("New method takes", time.time() - start)
    
    print("Is outputs matching: ", originalSubFolderList == newSubFolderList)