I want to group a list of strings by similarity.
In my case (here simplified cause this list can be huge) it's a list of path to zip files like this:
["path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip",
"path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip",
"path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip",
"path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip",
"path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip"]
I would like to group the strings in that list by a key, but I don't know yet how to define it (I guess with a lambda but I can't figure it out) in order to get a result list like this:
[["path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip",
"path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip"],
["path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip"],
["path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip",
"path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip"]]
To give you an example the first grouping key would be:
*_HELICOPTEROS-MARINOS_20230329_*_21049_00748_*.zip
second would be:
*_HELICOPTEROS-MARINOS_20230329_*_21049_00747_*.zip
and third:
*_NOLAS_20230326_*_20160_06473_*.zip
It's all about extracting the required key
to be used to group the file names together.
Here's a simplified function extract_features
that assumes that there are no additional _
in the filename apart from its standard format. It can be modified as per your file name convention to extract the required key, and then group them together using the itertools.groupby()
from itertools import groupby
def extract_features(f):
filename = f.split('/')[-1]
parts = filename.split('_')
return (parts[2], parts[3], parts[6], parts[7])
data = ["path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip",
"path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip",
"path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip",
"path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip",
"path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip"]
data.sort(key=extract_features)
output = []
for k, g in groupby(data, extract_features):
output.append(list(g))
print(output)
Output:
[['path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip'],
['path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip', 'path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip'],
['path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip', 'path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip']]
e.g. for the path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip
the sorting key would be ('HELICOPTEROS-MARINOS', '20230329', '21049', '00748')