Imagine I have a folder with the following items : default.xml df_ak01.1001.jpg df_ak01.1002.jpg df_ak01.1003.jpg df_ak01.1005.jpg df_ak01.1006.jpg
(Here we can see that df_ak01.1004.jpg is missing which is very difficult to spot if there are thousands of files in the directory ) The program should be able to run on any directory and the file name part (here) df_ak01 can vary everytime . Can someone help me on this one .
I was able to get the current working directory where the program is currently being run , and I couldn't think of a logic how I could file name part if they're generic and mostly unknown.
I just created a regex to search for files with df_ak01 in their names and list them (But that's not a good way to do it). But still not successful how I'd implement finding the missing image.
import os
import re
current = os.getcwd()
#I've just implemented the listing of files that match 'df_ak01'
a = [x for x in os.listdir(current) if re.match('df_ak01.*.jpg',x)]
print a
So I'd like to get a output something like :
1 default.xml
3 df_ak01.%04d.jpg 1001-1003
2 df_ak01.%04d.jpg 1005-1006
You can do as follows. Start with matching numbers with 4 or more digits, (the regex "\d{4,}"
matches 4 or more digits) and extract all numbers. Then group consecutive numbers together using more_itertools.consecutive_groups, and create the result list, and then print it
import re
import os
from more_itertools import consecutive_groups
files = ["default.xml", "df_ak01.1001.jpg", "df_ak01.1002.jpg", "df_ak01.1003.jpg", "df_ak01.1005.jpg", "df_ak01.1006.jpg"]
#Pattern to match numbers with 4 or more digits
pattern = re.compile("\d{4,}")
#Extract all numbers
a = [int(pattern.search(x).group(0)) for x in files if pattern.search(x)]
#[1001, 1002, 1003, 1005, 1006]
#Group consecutive numbers together
cons_groups = [list(group) for group in consecutive_groups(a)]
#[[1001, 1002, 1003], [1005, 1006]]
#Create result list
result = [ [len(x), '{}-{}'.format(x[0], x[-1])] for x in cons_groups]
#[[3, '1001-1003'], [2, '1005-1006']]
#Print the result list
for item in result:
print('{} df_ak01.%04d.jpg {}'.format(item[0], item[1]))
The output will be
3 df_ak01.%04d.jpg 1001-1003
2 df_ak01.%04d.jpg 1005-1006