Search code examples
python-3.xregexregex-lookarounds

Regex With Lookahead For Fixed Length String


strings = [
    r"C:\Photos\Selfies\1|",
    r"C:\HDPhotos\Landscapes\2|",
    r"C:\Filters\Pics\12345678|",
    r"C:\Filters\Pics2\00000000|",
    r"C:\Filters\Pics2\00000000|XAV7"
    ]
    
for string in strings:
    matchptrn = re.match(r"(?P<file_path>.*)(?!\d{8})", string)
    if matchptrn:
        print("FILE PATH = "+matchptrn.group('file_path'))

I am trying to get this regular expression with a lookahead to work the way I though it would. Examples of Look Aheads on most websites seem to be pretty basic string matches i.e. not matching 'bar' if it is preceded by a 'foo' as an example of a negative look behind.

My goal is to capture in the group file_path the actual file path only if the string does NOT have an 8 character length number in it just before the pipe symbol | and match anything after the pipe symbol in another group (something I haven't implemented here).

So in the above example it should match only the first two strings

C:\Photos\Selfies\1
C:\HDPhotos\Landscapes\2

In case of the last string

C:\Filters\Pics2\00000000|XAV7

I'd like to match C:\Filters\Pics2\00000000 in <file_path> and match XAV7in another group named .
(This is something I can figure out on my own if I get some help with the negative look ahead)

Currently <file_path> matches everything, which makes sense since it is non-greedy (.*) I want it to only capture if the last part of the string before the pipe symbol is NOT an 8 length character.

OUTPUT OF CODE SNIPPET PASTED BELOW

FILE PATH = C:\Photos\Selfies\1|
FILE PATH = C:\HDPhotos\Landscapes\2|
FILE PATH = C:\Filters\Pics\12345678|
FILE PATH = C:\Filters\Pics2\00000000|
FILE PATH = C:\Filters\Pics2\00000000|XAV7

Making this modification of \\

matchptrn = re.match(r"(?P<file_path>.*)\\(?!\d{8})", string)
if matchptrn:
    print("FILE PATH = "+matchptrn.group('file_path'))

makes things worse as the output is

FILE PATH = C:\Photos\Selfies
FILE PATH = C:\HDPhotos\Landscapes
FILE PATH = C:\Filters
FILE PATH = C:\Filters
FILE PATH = C:\Filters

Can someone please explain this as well ?


Solution

  • You can use

    ^(?!.*\\\d{8}\|$)(?P<file_path>.*)\|(?P<suffix>.*)
    

    See the regex demo.

    Details

    • ^ - start of a string
    • (?!.*\\\d{8}\|$) - fail the match if the string contains \ followed with eight digits and then | at the end of string
    • (?P<file_path>.*) - Group "file_path": any zero or more chars other than line break chars as many as possible
    • \| - a pipe
    • (?P<suffix>.*) - Group "sfuffix": the rest of the string, any zero or more chars other than line break chars, as many as possible.

    See the Python demo:

    import re
    strings = [
        r"C:\Photos\Selfies\1|",
        r"C:\HDPhotos\Landscapes\2|",
        r"C:\Filters\Pics\12345678|",
        r"C:\Filters\Pics2\00000000|",
        r"C:\Filters\Pics2\00000000|XAV7"
        ]
        
    for string in strings:
        matchptrn = re.match(r"(?!.*\\\d{8}\|$)(?P<file_path>.*)\|(?P<suffix>.*)", string)
        if matchptrn:
            print("FILE PATH = {}, SUFFIX = {}".format(*matchptrn.groups()))
    

    Output:

    FILE PATH = C:\Photos\Selfies\1, SUFFIX = 
    FILE PATH = C:\HDPhotos\Landscapes\2, SUFFIX = 
    FILE PATH = C:\Filters\Pics2\00000000, SUFFIX = XAV7