Search code examples
pythonrandomdatasetsubdirectory

Select randomly x files in subdirectories


I need to take exactly 10 files (images) in a dataset randomly, but this dataset is hierarchically structured.

So I need that for each subdirectory that contains images hold just 10 of them randomly. Is there an easy way to do that or I should do it manually?

def getListOfFiles(dirName):
    ### create a list of file and sub directories 
    ### names in the given directory 
    listOfFile = os.listdir(dirName)
    allFiles = list()
    ### Iterate over all the entries
    for entry in listOfFile:

        ### Create full path
        fullPath = os.path.join(dirName, entry)
        ### If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles = allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(random.sample(fullPath, 10))
    return allFiles

dirName = 'C:/Users/bla/bla'

### Get the list of all files in directory tree at given path
listOfFiles = getListOfFiles(dirName)

with open("elements.txt", mode='x') as f:
    for elem in listOfFiles:
        f.write(elem + '\n')

Solution

  • Good approach to sample from unknown size directory listing is to use Reservoir Sampling. With this approach, you don't have to run upfront and list all files in the directory. Read it one-by-one and sample. It even works when you have to sample fixed number of files across multiple directories.

    It would be good to use generator-based directory scanning code, which picks one file at a time, thus you don't use gobs of memory upfront to hold all file names.

    Along the lines (NB! undested code!)

    import numpy as np
    import os
    
    def ResSampleFiles(dirname, N):
        """pick N files from directory"""
    
        sampled_files = list()
        k = 0
        for item in scandir(dirname):
            if item.is_dir():
                continue
            full_path = os.path.join(dirname, item.name)
            if k < N:
                sampled_files.append(full_path)
            else:
                idx = np.random.randint(0, k+1)
                if (idx < N):
                    sampled_files[idx] = full_path
            k += 1
    
        return sampled_files