Search code examples
pythonlistparsingsequencefilenames

How to identify files that have increasing numbers and a similar form of filename?


I have a directory of files, some of them image files. Some of those image files are a sequence of images. They could be named image-000001.png, image-000002.png and so on, or perhaps 001_sequence.png, 002_sequence.png etc.

How can we identify images that would, to a human, appear by their names to be fairly obviously in a sequence? This would mean identifying only those image filenames that have increasing numbers and all have a similar form of filename.

The similar part of the filename would not be pre-defined.


Solution

  • You can use a regular expression to get files adhering to a certain pattern, e.g. .*\d+.*\.(jpg|png) for anything, then a number, then more anything, and an image extension.

    files = ["image-000001.png", "image-000002.png", "001_sequence.png", 
             "002_sequence.png", "not an image 1.doc", "not an image 2.doc", 
             "other stuff.txt", "singular image.jpg"]
    
    import re
    image_files = [f for f in files if re.match(r".*\d+.*\.(jpg|png)", f)]
    

    Now, group those image files by replacing the number with some generic string, e.g. XXX:

    patterns = collections.defaultdict(list)
    for f in image_files:
        p = re.sub("\d+", "XXX", f)
        patterns[p].append(f)
    

    As a result, patterns is

    {'image-XXX.png': ['image-000001.png', 'image-000002.png'], 
     'XXX_sequence.png': ['001_sequence.png', '002_sequence.png']}
    

    Similarly, it should not be too hard to check whether all those numbers are consecutive, but maybe that's not really necessary after all. Note, however, that this will have problems discriminating numbered series such as "series1_001.jpg", and "series2_001.jpg".