Search code examples
pythonregextext-miningregular-language

re.search() in python goes into an infinite loop


I'm trying to extract file paths (Windows/Ubuntu, relative/absolute) from a text document.

The regular expression code below is used check if a word is a file path or not.

It works for most of the cases but fails for one case, where it goes into an infinite loop. Any explanation for this?

import re
path_regex = re.compile(r'^([\.]*)([/]+)(((?![<>:"/\\|?*]).)+((?<![ .])(\\|/))?)*$' , re.I)
text = '/var/lib/jenkins/jobs/abcd-deploy-test-environment-oneccc/workspace/../workspace/abcd-deploy-test-environment.sh'
path_regex.search(text)

Solution

  • Indeed there is a problem.
    You have overlayed subexpressions mixed with spurious quantifiers.

    modified for required parts between slashes
    It is easily fixed using this ^([\.]*)([/]+)((?:[^<>:"/\\|?*.\r\n]|\.(?![\\/]))[\\/]?)*$

    The idea is to see just what your guarding against.
    The guard is that you'd allow forward or back slash if not preceeded by a dot.

    So, you have to include the dot in the exclusion class with the \ and /
    then qualify them in a separate alternation.

    If you do it this way, it will always pass.

     ^ 
     ( [\.]* )                     # (1)
     ( [/]+ )                      # (2)
     (                             # (3 start)
          (?:                           # Group start (required between slashes)
               [^<>:"/\\|?*.\r\n]            # Any character, but exclude these
            |                              # or,
               \.                            # The dot, if not followed by forward or back slash
               (?! [\\/] )
          )                             # Group end
          [\\/]?                        # Optional forward or back shash
     )*                            # (3 end)
     $