Search code examples
pythonloopssortingunique

Sort, group and process files based on an embedded timestamp in the filename


How do I sort these files by a date string embedded in each filename? And then I would like to loop over all the files created on the same day.

I can do this in the shell but it is very slow. I'd like to do the same in python.

Sample file list (there are 2200 files total)

  1. Tyler Cowen On Reading 202109200657.md
  2. On Poems 202109210659.md
  3. Slava Akhmechet On Reading In Clusters 202109200659.md
  4. Ideation In A 4X4 Matrix 202109200717.md
  5. Drawing Grid Ideation 202109220830.md
  6. Dictation 201208251425.md

Output would look like this (for eventual graphing with Plotly.)

20120825,1  
20210920,3  
20210921,1  
20210922,1  

I want to sort by doc count on a given day, then within doc count by date. So results 1, 3 and 4 above would be listed in date order:

20210920,3
20120825,1  
20210921,1  
20210922,1  

Then I would like to do other stuff with each day's documents like get total word count for the day.


Solution

  • If you're trying to replace a shell script, your Python script will probably need to do the following.

    1. List the contents of a directory to get the filenames.
    2. Extract the date from the filenames (assuming a regular expression pattern match of \d{8} is good enough to extract the date).
    3. Sort or otherwise group the files by the extracted date.
    4. Iterate over those groups to do something.
    import pathlib
    import re
    from collections import defaultdict
    
    date_pattern = re.compile(r"\d{8}")
    target_dir = pathlib.Path("myfolder")
    
    # Files is a dictionary mapping a date to the list of files with that date
    files = defaultdict(list)
    for child in target_dir.iterdir():
        # Skip directories
        if child.is_dir():
            continue
        match = date_pattern.search(child.name)
        # Skip files that do not match the date pattern
        if match is None:
            continue
        file_date = match.group()
        files[file_date].append(child)
    
    for date, names in files.items():
        for filename in names:
            # Do something
            print(date, filename)
    

    Edit: sort by the date

    To sort by the date, the last code block can be modified.

    for date in sorted(files):
        for filename in files[date]:
            # Do something
            print(date, filename)
    

    You could also use for date, names in sorted(files.items(), key=lambda d: d[0]):