Search code examples
pythonpython-3.xstringlistgrouping

Group a list of strings by similar values


I want to group a list of strings by similarity.

In my case (here simplified cause this list can be huge) it's a list of path to zip files like this:

["path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip",
 "path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip",
 "path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip",
 "path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip",
 "path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip"]

I would like to group the strings in that list by a key, but I don't know yet how to define it (I guess with a lambda but I can't figure it out) in order to get a result list like this:

[["path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip",
  "path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip"],
 ["path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip"],
 ["path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip",
  "path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip"]]

To give you an example the first grouping key would be:

*_HELICOPTEROS-MARINOS_20230329_*_21049_00748_*.zip

second would be:

*_HELICOPTEROS-MARINOS_20230329_*_21049_00747_*.zip

and third:

*_NOLAS_20230326_*_20160_06473_*.zip

Solution

  • It's all about extracting the required key to be used to group the file names together.

    Here's a simplified function extract_features that assumes that there are no additional _ in the filename apart from its standard format. It can be modified as per your file name convention to extract the required key, and then group them together using the itertools.groupby()

    from itertools import groupby
    
    def extract_features(f):
        filename = f.split('/')[-1]
        parts = filename.split('_')
        return (parts[2], parts[3], parts[6], parts[7])
    
    data = ["path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip",
     "path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip",
     "path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip",
     "path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip",
     "path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip"]
    
    data.sort(key=extract_features)
    output = []
    
    for k, g in groupby(data, extract_features):
        output.append(list(g))
    
    print(output)
    

    Output:

    [['path/002_AC_HELICOPTEROS-MARINOS_20230329_205916_145T3_21049_00747_VMS_UMS.zip'],
    ['path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip', 'path/002_AC_HELICOPTEROS-MARINOS_20230329_211105_145T3_21049_00748_FDCR.zip'],
    ['path/002_AC_NOLAS_20230326_234440_145T2_20160_06473_VMS_UMS.zip', 'path/002_AC_NOLAS_20230326_235504_145T2_20160_06473_FDCR.zip']]
    

    e.g. for the path/002_AC_HELICOPTEROS-MARINOS_20230329_210358_145T3_21049_00748_DMAU.zip the sorting key would be ('HELICOPTEROS-MARINOS', '20230329', '21049', '00748')