How do I sort these files by a date string embedded in each filename? And then I would like to loop over all the files created on the same day.
I can do this in the shell but it is very slow. I'd like to do the same in python.
Sample file list (there are 2200 files total)
Output would look like this (for eventual graphing with Plotly.)
20120825,1
20210920,3
20210921,1
20210922,1
I want to sort by doc count on a given day, then within doc count by date. So results 1, 3 and 4 above would be listed in date order:
20210920,3
20120825,1
20210921,1
20210922,1
Then I would like to do other stuff with each day's documents like get total word count for the day.
If you're trying to replace a shell script, your Python script will probably need to do the following.
\d{8}
is good enough to extract the date).import pathlib
import re
from collections import defaultdict
date_pattern = re.compile(r"\d{8}")
target_dir = pathlib.Path("myfolder")
# Files is a dictionary mapping a date to the list of files with that date
files = defaultdict(list)
for child in target_dir.iterdir():
# Skip directories
if child.is_dir():
continue
match = date_pattern.search(child.name)
# Skip files that do not match the date pattern
if match is None:
continue
file_date = match.group()
files[file_date].append(child)
for date, names in files.items():
for filename in names:
# Do something
print(date, filename)
To sort by the date, the last code block can be modified.
for date in sorted(files):
for filename in files[date]:
# Do something
print(date, filename)
You could also use for date, names in sorted(files.items(), key=lambda d: d[0]):