Search code examples
regexfile-search

Regular Expression for finding filenames with certain parameters but not others


First time poster. Not a programmer, but I dabble where it intersects with my work as a composer and sound designer. I apologize if this is unnecessarily detailed, but it's the best way I can explain it.

I've been working on reorganizing my massive collection of music loops and sound effects (nearly 100k files), which includes using more consistent file renaming. I've done a lot of the work all ready using a filesearch app and a renaming app, but neither accept regular expressions. I've since found apps that do and I've been able to improve my turnaround time, but I know it can be improved further.

My current task is to find files that contain a range of strings preceded by, and followed by, a combination of spaces, as well as ones that are preceded by only a single space and are at the end of the filename, before the file extension.

Filenames I am looking for contain:

A string containing an upper or lowercase letter A-G, possibly followed by a lowercase b or # or m or an uppercase M
String is also preceeded by one space and followed by another space, but *not* by two
Also find results where string is preceded by two spaces, but followed by only one
Also find the reverse: where string is preceded by one space, but followed by two
Also find results where string is at the end of the filename (before the file extension) and preceeded by one space, but not two.

Examples:

Strings can be things like: Am, F#, EM, Db, G
Closer Kit Am 140bpm.wav  (1spaceAm1space)
EXCLUDING Closer Kit  F#  140bpm.wav (2spacesF#2spaces)
Closer Kit  EM 140bpm.wav  (2spacesEM1space)
Closer Kit Db  140bpm.wav  (1spacesDb2spaces)
Closer Kit  140bpm G.wav  (spaceGfileextension)

I started with "search for files containing the string a space, an uppercase or lowercase letters A-G possibly followed by a lowercase b or m or # or an uppercase M, and a space":

\s[A-Ga-g][b|#|m|M]\s

This resulted, as I expected, in retrieving files like:

01 Brass Swell  Cm  60bpm.wav  (2spacesCm2spaces)
01 Evolving Pads  Below the City C 60bpm-24b.wav  (1spaceC1space)

But also returned:

125bpm  F# Vocal Chops.wav  (2spacesF#1space)
Brass  140bpm C.wav  (spaceCfileextension)

Where it started to fall apart was when I updated it to indicate one space on either side of the desired string, but not 2:

\s(?!\s{2})[A-G][b|#|m](?!\s{2})\s

This resulted in returning files with any number of spaces on either side of the string, including one and two, as well as some with the desired string at the end, before the file extension:

07 Woodwind—Clarinet  C  60bpm.wav  (2spacesC2spces)
128bpm  Bass Am 1.wav  (1spaceAm1space)
85 Pad Flutter 1 Am.wav  (1spaceAmfileextension)
125bpm  F# Pad.wav  (2spacesF#1space)

I'm assuming my problem has to do with how I'm expressing the EXCLUDE part of the formula? Or am I missing the use of some /'s or \'s? The search app will let me stack search parameters, which might be the best answer, but it only offers "match regex" and not "exclude regex", so I'm unsure how to deal with that.

Suggestions?

(I've looked at the discussions suggested by the algoritm, but they don't seem applicable.)


Solution

  • It's not clear what the expected outputs are. It seems you want to output the entire line and also capture those F#, Am, etc.:

    ^[^\r\n]*\s+\b((?:[A-Ga-g])[Mmb#])*\s+[\w]+\.(?:wav|mp[3-4])[^\r\n]*$
    

    Code:

    import re
    
    s = """
    Closer Kit Am 140bpm.wav  (1spaceAm1space)
    EXCLUDING Closer Kit  F#  140bpm.wav (2spacesF#2spaces)
    EXCLUDING Closer Kit  F#  140bpm.mp4 (2spacesF#2spaces)
    EXCLUDING Closer Kit  F#  140bpm.mp3 (2spacesF#2spaces)
    Closer Kit  EM 140bpm.wav  (2spacesEM1space)
    Closer Kit Db  140bpm.wav  (1spacesDb2spaces)
    Closer Kit  140bpm G.wav  (spaceGfileextension)
    01 Brass Swell  Cm  60bpm.wav  (2spacesCm2spaces)
    01 Evolving Pads  Below the City C 60bpm-24b.wav  (1spaceC1space)
    125bpm  F# Vocal Chops.wav  (2spacesF#1space)
    Brass  140bpm C.wav  (spaceCfileextension)
    """
    
    p = r"(?m)^([^\r\n]*\s+\b((?:[A-Ga-g])[Mmb#])*\s+[\w]+\.(?:wav|mp[3-4])[^\r\n]*)$"
    
    find_patterns = re.findall(p, s)
    print(find_patterns)
    
    

    Prints

    [('Closer Kit Am 140bpm.wav (1spaceAm1space)', 'Am'), ('EXCLUDING Closer Kit F# 140bpm.wav (2spacesF#2spaces)', 'F#'), ('EXCLUDING Closer Kit F# 140bpm.mp4 (2spacesF#2spaces)', 'F#'), ('EXCLUDING Closer Kit F# 140bpm.mp3 (2spacesF#2spaces)', 'F#'), ('Closer Kit EM 140bpm.wav (2spacesEM1space)', 'EM'), ('Closer Kit Db 140bpm.wav (1spacesDb2spaces)', 'Db'), ('01 Brass Swell Cm 60bpm.wav (2spacesCm2spaces)', 'Cm')]

    Note

    You can change this pattern, based on your requirement:

    (?m)^([^\r\n]*\s+\b((?:[A-Ga-g])[Mmb#])*\s+[\w]+\.(?:wav|mp[3-4])[^\r\n]*)$
    
    • (?m): multiline flag can be removed if you're dealing with single lines.

    • [\w]+\.(?:wav|mp[3-4]) helps to narrow down your search space, can be removed.

    • \s+\b((?:[A-Ga-g])[Mmb#]) seems to pull out the expected outputs.


    You can loose up or tight up the pattern, if you want. A more loose pattern is:

    ^([^\r\n]*\s+\b((?:[A-Ga-g])[Mmb#]?)*\s+[^\r\n\s]+\.(?:wav|mp[3-4])[^\r\n]*)$
    

    Code

    import re
    
    s = """
    Closer Kit Am 140bpm.wav  (1spaceAm1space)
    EXCLUDING Closer Kit  F#  140bpm.wav (2spacesF#2spaces)
    EXCLUDING Closer Kit  F#  140bpm.mp4 (2spacesF#2spaces)
    EXCLUDING Closer Kit  F#  140bpm.mp3 (2spacesF#2spaces)
    Closer Kit  EM 140bpm.wav  (2spacesEM1space)
    Closer Kit Db  140bpm.wav  (1spacesDb2spaces)
    Closer Kit  140bpm G.wav  (spaceGfileextension)
    01 Brass Swell  Cm  60bpm.wav  (2spacesCm2spaces)
    01 Evolving Pads  Below the City C 60bpm-24b.wav  (1spaceC1space)
    125bpm  F# Vocal Chops.wav  (2spacesF#1space)
    Brass  140bpm C.wav  (spaceCfileextension)
    """
    
    p = r"(?m)^([^\r\n]*\s+\b((?:[A-Ga-g])[Mmb#]?)*\s+[^\r\n\s]+\.(?:wav|mp[3-4])[^\r\n]*)$"
    
    find_patterns = re.findall(p, s)
    print(find_patterns)
    
    

    Prints

    [('Closer Kit Am 140bpm.wav (1spaceAm1space)', 'Am'), ('EXCLUDING Closer Kit F# 140bpm.wav (2spacesF#2spaces)', 'F#'), ('EXCLUDING Closer Kit F# 140bpm.mp4 (2spacesF#2spaces)', 'F#'), ('EXCLUDING Closer Kit F# 140bpm.mp3 (2spacesF#2spaces)', 'F#'), ('Closer Kit EM 140bpm.wav (2spacesEM1space)', 'EM'), ('Closer Kit Db 140bpm.wav (1spacesDb2spaces)', 'Db'), ('01 Brass Swell Cm 60bpm.wav (2spacesCm2spaces)', 'Cm'), ('01 Evolving Pads Below the City C 60bpm-24b.wav (1spaceC1space)', 'C')]

    Note

    • [Mmb#]? makes this one character optional.
    • [^\r\n\s]+ is more loose pattern for the audio file name.
    • (?:wav|mp[3-4]) here you can add any other file extensions that you may have.