Search code examples
pythonperformancescandir

How can I improve performance of finding all files in a folder created at a certain date?


There are 10,000 files in a folder. Few files are created on 2018-06-01, few on 2018-06-09, like that.

I need to find all files which are created on 2018-06-09. But it is taking to much time (almost 2 hours) to read each file and get the file creation date and then get the files which are created on 2018-06-09.

for file in os.scandir(Path):
    if file.is_file():
        file_ctime = datetime.fromtimestamp(os.path.getctime(file)).strftime('%Y- %m- %d %H:%M:%S')
        if file_ctime[0:4] == '2018-06-09':
            # ...  

Solution

  • Let's start with the most basic thing - why are you building a datetime only to re-format it as string and then do a string comparison?

    Then there is the whole point of using os.scandir() over os.listdir() - os.scandir() returns a os.DirEntry which caches file stats through the os.DirEntry.stat() call.

    In dependence of checks you need to perform, os.listdir() might even perform better if you expect to do a lot of filtering on the filename as then you won't need to build up a whole os.DirEntry just to discard it.

    So, to optimize your loop, if you don't expect a lot of filtering on the name:

    for entry in os.scandir(Path):
        if entry.is_file() and 1528495200 <= entry.stat().st_ctime < 1528581600:
            pass  # do whatever you need with it
    

    If you do, then better stick with os.listdir() as:

    import stat
    
    for entry in os.listdir(Path):
        # do your filtering on the entry name first...
        path = os.path.join(Path, entry)  # build path to the listed entry...
        stats = os.stat(path)  # cache the file entry statistics
        if stat.S_ISREG(stats.st_mode) and 1528495200 <= stats.st_ctime < 1528581600:
            pass  # do whatever you need with it
    

    If you want to be flexible with the timestamps, use datetime.datetime.timestamp() beforehand to get the POSIX timestamps and then you can compare them against what stat_result.st_ctime returns directly without conversion.

    However, even your original, non-optimized approach should be significantly faster than 2 hours for a mere 10k entries. I'd check the underlying filesystem, too, something seems wrong there.