I have a directory of files, some of them image files. Some of those image files are a sequence of images. They could be named image-000001.png
, image-000002.png
and so on, or perhaps 001_sequence.png
, 002_sequence.png
etc.
How can we identify images that would, to a human, appear by their names to be fairly obviously in a sequence? This would mean identifying only those image filenames that have increasing numbers and all have a similar form of filename.
The similar part of the filename would not be pre-defined.
You can use a regular expression to get files adhering to a certain pattern, e.g. .*\d+.*\.(jpg|png)
for anything, then a number, then more anything, and an image extension.
files = ["image-000001.png", "image-000002.png", "001_sequence.png",
"002_sequence.png", "not an image 1.doc", "not an image 2.doc",
"other stuff.txt", "singular image.jpg"]
import re
image_files = [f for f in files if re.match(r".*\d+.*\.(jpg|png)", f)]
Now, group those image files by replacing the number with some generic string, e.g. XXX
:
patterns = collections.defaultdict(list)
for f in image_files:
p = re.sub("\d+", "XXX", f)
patterns[p].append(f)
As a result, patterns
is
{'image-XXX.png': ['image-000001.png', 'image-000002.png'],
'XXX_sequence.png': ['001_sequence.png', '002_sequence.png']}
Similarly, it should not be too hard to check whether all those numbers are consecutive, but maybe that's not really necessary after all. Note, however, that this will have problems discriminating numbered series such as "series1_001.jpg"
, and "series2_001.jpg"
.