Search code examples
pythonregexpython-2.7formattingstring-formatting

How to combine files of similar file names together in python?


Imagine I have a folder with the following items : default.xml df_ak01.1001.jpg df_ak01.1002.jpg df_ak01.1003.jpg df_ak01.1005.jpg df_ak01.1006.jpg

(Here we can see that df_ak01.1004.jpg is missing which is very difficult to spot if there are thousands of files in the directory ) The program should be able to run on any directory and the file name part (here) df_ak01 can vary everytime . Can someone help me on this one .

I was able to get the current working directory where the program is currently being run , and I couldn't think of a logic how I could file name part if they're generic and mostly unknown.

I just created a regex to search for files with df_ak01 in their names and list them (But that's not a good way to do it). But still not successful how I'd implement finding the missing image.

import os
import re

current = os.getcwd()

#I've just implemented the listing of files that match 'df_ak01'
a = [x for x in os.listdir(current) if re.match('df_ak01.*.jpg',x)]
print a

So I'd like to get a output something like :

1 default.xml
3 df_ak01.%04d.jpg   1001-1003
2 df_ak01.%04d.jpg   1005-1006

Solution

  • You can do as follows. Start with matching numbers with 4 or more digits, (the regex "\d{4,}" matches 4 or more digits) and extract all numbers. Then group consecutive numbers together using more_itertools.consecutive_groups, and create the result list, and then print it

    import re
    import os
    from more_itertools import consecutive_groups
    files = ["default.xml", "df_ak01.1001.jpg", "df_ak01.1002.jpg", "df_ak01.1003.jpg", "df_ak01.1005.jpg", "df_ak01.1006.jpg"]
    
    #Pattern to match numbers with 4 or more digits
    pattern = re.compile("\d{4,}")
    
    #Extract all numbers
    a = [int(pattern.search(x).group(0)) for x in files if pattern.search(x)]
    #[1001, 1002, 1003, 1005, 1006]
    
    #Group consecutive numbers together
    cons_groups = [list(group) for group in consecutive_groups(a)]
    #[[1001, 1002, 1003], [1005, 1006]]
    
    #Create result list
    result = [ [len(x), '{}-{}'.format(x[0], x[-1])] for x in cons_groups]
    #[[3, '1001-1003'], [2, '1005-1006']]
    
    #Print the result list
    for item in result:
        print('{} df_ak01.%04d.jpg {}'.format(item[0], item[1]))
    

    The output will be

    3 df_ak01.%04d.jpg 1001-1003
    2 df_ak01.%04d.jpg 1005-1006