I'm trying to extract file paths (Windows/Ubuntu, relative/absolute) from a text document.
The regular expression code below is used check if a word is a file path or not.
It works for most of the cases but fails for one case, where it goes into an infinite loop. Any explanation for this?
import re
path_regex = re.compile(r'^([\.]*)([/]+)(((?![<>:"/\\|?*]).)+((?<![ .])(\\|/))?)*$' , re.I)
text = '/var/lib/jenkins/jobs/abcd-deploy-test-environment-oneccc/workspace/../workspace/abcd-deploy-test-environment.sh'
path_regex.search(text)
Indeed there is a problem.
You have overlayed subexpressions mixed with spurious quantifiers.
modified for required parts between slashes
It is easily fixed using this ^([\.]*)([/]+)((?:[^<>:"/\\|?*.\r\n]|\.(?![\\/]))[\\/]?)*$
The idea is to see just what your guarding against.
The guard is that you'd allow forward or back slash if not preceeded by a dot.
So, you have to include the dot in the exclusion class with the \ and /
then qualify them in a separate alternation.
If you do it this way, it will always pass.
^
( [\.]* ) # (1)
( [/]+ ) # (2)
( # (3 start)
(?: # Group start (required between slashes)
[^<>:"/\\|?*.\r\n] # Any character, but exclude these
| # or,
\. # The dot, if not followed by forward or back slash
(?! [\\/] )
) # Group end
[\\/]? # Optional forward or back shash
)* # (3 end)
$