Search code examples
pythonfilenamesglob

Inverse glob - reverse engineer a wildcard string from file names


I want to generate a wildcard string from a pair of file names. Kind of an inverse-glob. Example:

file1 = 'some foo file.txt'
file2 = 'some bar file.txt'
assert 'some * file.txt' == inverse_glob(file1, file2)

Use difflib perhaps? Has this been solved already?

Application is a large set of data files with similar names. I want to compare each pair of file names and then present a comparison of pairs of files with "similar" names. I figure if I can do a reverse-glob on each pair, then those pairs with "good" wildcards (e.g. not lots*of*stars*.txt nor *) are good candidates for comparison. So I might take the output of this putative inverse_glob() and reject wildcards that have more than one * or for which glob() doesn't produce exactly two files.


Solution

  • For instance:

    Filenames:

    names = [('some foo file.txt','some bar file.txt', 'some * file.txt'),
             ("filename.txt", "filename2.txt", "filenam*.txt"),
             ("1filename.txt", "filename2.txt", "*.txt"),
             ("inverse_glob", "inverse_glob2", "inverse_glo*"),
             ("the 24MHz run new.sr", "the 16MHz run old.sr", "the *MHz run *.sr")]
    

    def inverse_glob(...):

        import re
        def inverse_glob(f1, f2, force_single_asterisk=None):
            def adjust_name(pp, diff):
                if len(pp) == 2:
                    return pp[0][:-diff] + '?'*(diff+1) + '.' + pp[1]
                else:
                    return pp[0][:-diff] + '?' * (diff + 1)
    
            l1 = len(f1); l2 = len(f2)
            if l1 > l2:
                f2 = adjust_name(f2.split('.'), l1-l2)
            elif l2 > l1:
                f1 = adjust_name(f1.split('.'), l2-l1)
    
            result = ['?' for n in range(len(f1))]
            for i, c in enumerate(f1):
                if c == f2[i]:
                    result[i] = c
    
            result = ''.join(result)
            result = re.sub(r'\?{2,}', '*', result)
            if force_single_asterisk:
                result = re.sub(r'\*.+\*', '*', result)
            return result
    

    Usage:

    for name in names:
        result = inverse_glob(name[0], name[1])
        print('{:20} <=> {:20} = {}'.format(name[0], name[1], result))
        assert name[2] == result
    

    Output:

    some foo file.txt    <=> some bar file.txt    = some * file.txt  
    filename.txt         <=> filename2.txt        = filenam*.txt  
    1filename.txt        <=> filename2.txt        = *.txt  
    inverse_glob         <=> inverse_glob2        = inverse_glo*
    the 24MHz run new.sr <=> the 16MHz run old.sr = the *MHz run *.sr
    

    Tested with Python:3.4.2