Search code examples
pythonlistsortingcollectionsdefaultdict

Sorting and Matching a Python list


I recently asked a similar question but need to go a little deeper.

Essentially, I am reading a directory of files and appending everything to a list called filelistname

I am trying to sort this list by the diskcount (-#disk-), and running a function against that sorted list.

Thanks for your help.


Here is an example -

 In []: filelistname
Out []: ['C:\Test3\ARRAY05-2NODE-RAID1-12disk-128k-0-segmented.xlsx'
         'C:\Test1\ARRAY05-2NODE-RAID1-17disk-128k-0-segmented.xlsx',
         'C:\Test4\ARRAY05-2NODE-RAID1-25disk-128k-0-segmented.xlsx',
         'C:\Test2\ARRAY05-2NODE-RAID1-18disk-128k-0-segmented.xlsx',
         'C:\Test1\ARRAY05-2NODE-RAID1-12disk-32k-0-segmented.xlsx',
         'C:\Test6\ARRAY05-2NODE-RAID1-25disk-32k-0-segmented.xlsx',
         'C:\Test2\ARRAY05-2NODE-RAID1-12disk-64k-0-segmented.xlsx',
         'C:\Test5\ARRAY05-2NODE-RAID1-12disk-64k-100-segmented.xlsx']

An output for this would look something like this.

A group

  C:\Test3\ARRAY05-2NODE-RAID1-12disk-128k-0-segmented.xlsx
  C:\Test1\ARRAY05-2NODE-RAID1-17disk-128k-0-segmented.xlsx
  C:\Test2\ARRAY05-2NODE-RAID1-18disk-128k-0-segmented.xlsx

Another gorup

  C:\Test4\ARRAY05-4NODE-RAID1-25disk-128k-0-segmented.xlsx

Another group

  C:\Test1\ARRAY05-2NODE-RAID1-12disk-32k-0-segmented.xlsx
  C:\Test6\ARRAY05-2NODE-RAID1-25disk-32k-0-segmented.xlsx

Another Group

  C:\Test2\ARRAY05-2NODE-RAID1-12disk-64k-0-segmented.xlsx

Another group

  C:\Test5\ARRAY05-2NODE-RAID1-12disk-64k-100-segmented.xlsx

I'm currently playing with this, but having trouble identifying a correct key.

import os
from itertools import groupby
from collections import defaultdict

key_fn = lambda s: s.rsplit('-',4)[0]

filelistname = sorted(filelistname, key=key_fn)
print(key)

for key, grouped_file_names in groupby(filelistname, key=key_fn):
    print('\n'.join(list(grouped_file_names)))
    print("")

Solution

  • You seem to be grouping by d+k-d+ so split the basename and use those as the keys:

    from collections import defaultdict
    d = defaultdict(list)
    
    for sub in l:
        spl = sub.rsplit("-", 3)
        k = spl[-3],spl[-2]
        d[k].append(sub)
    

    Output:

    from pprint import pprint as pp
    
    pp(d)
    
    { ('128k', '0'): [ 'C:\\Test3\\ARRAY05-2NODE-RAID1-12disk-128k-0-segmented.xlsxC:\\Test1\\ARRAY05-2NODE-RAID1-17disk-128k-0-segmented.xlsx',
                       'C:\\Test4\\ARRAY05-2NODE-RAID1-25disk-128k-0-segmented.xlsx',
                       'C:\\Test2\\ARRAY05-2NODE-RAID1-18disk-128k-0-segmented.xlsx'],
      ('32k', '0'): [ 'C:\\Test1\\ARRAY05-2NODE-RAID1-12disk-32k-0-segmented.xlsx',
                      'C:\\Test6\\ARRAY05-2NODE-RAID1-25disk-32k-0-segmented.xlsx'],
      ('64k', '0'): ['C:\\Test2\\ARRAY05-2NODE-RAID1-12disk-64k-0-segmented.xlsx'],
      ('64k', '100'): [ 'C:\\Test5\\ARRAY05-2NODE-RAID1-12disk-64k-100-segmented.xlsx']}
    

    If you want all but the disk part:

    from collections import defaultdict
    from os import path
    from ntpath import basename
    d = defaultdict(list)
    
    for sub in l:
        spl = basename(sub).rsplit("-", 5)
        k = spl[0]+"-" + "-".join(spl[3:5])
        d[k].append(sub)
    

    Output:

    {'ARRAY05-2NODE-128k-0': ['C:\\Test3\\ARRAY05-2NODE-RAID1-12disk-128k-0-segmented.xlsx',
                              'C:\\Test1\\ARRAY05-2NODE-RAID1-17disk-128k-0-segmented.xlsx',
                              'C:\\Test4\\ARRAY05-2NODE-RAID1-25disk-128k-0-segmented.xlsx',
                              'C:\\Test2\\ARRAY05-2NODE-RAID1-18disk-128k-0-segmented.xlsx'],
     'ARRAY05-2NODE-32k-0': ['C:\\Test1\\ARRAY05-2NODE-RAID1-12disk-32k-0-segmented.xlsx',
                             'C:\\Test6\\ARRAY05-2NODE-RAID1-25disk-32k-0-segmented.xlsx'],
     'ARRAY05-2NODE-64k-0': ['C:\\Test2\\ARRAY05-2NODE-RAID1-12disk-64k-0-segmented.xlsx'],
     'ARRAY05-2NODE-64k-100': ['C:\\Test5\\ARRAY05-2NODE-RAID1-12disk-64k-100-segmented.xlsx']}