Search code examples
pythonregexlistsplitwhitespace

Anyway to separate at whitespaces but avoid separating at file paths?


I am trying to separate this string into a list using regex:

-y -hwaccel cuda -threads 8 -loglevel error -hide_banner -stats -i - -c:v hevc_nvenc -rc constqp -preset p7 -qp 18 C:\Users\User\Documents\Python\Smoothie\test 124\Resampled_vid.mp4

I am using the following method to separate it:

split(r'(?!\\)'+'\s+',f"{Settings[1]}".format(Input=InFile,Output=OutFile))

Output:
['-y', '-hwaccel', 'cuda', '-threads', '8', '-loglevel', 'error', '-hide_banner', '-stats', '-i', '-', '-c:v', 'hevc_nvenc', '-rc', 'constqp', '-preset', 'p7', '-qp', '18', 'C:\\Users\\User\\Documents\\Python\\Smoothie\\test', '124\\Resampled_vid.mp4']

Desired Output:

['-y', '-hwaccel', 'cuda', '-threads', '8', '-loglevel', 'error', '-hide_banner', '-stats', '-i', '-', '-c:v', 'hevc_nvenc', '-rc', 'constqp', '-preset', 'p7', '-qp', '18', 'C:\\Users\\User\\Documents\\Python\\Smoothie\\test 124\\Resampled_vid.mp4']

Is there anyway, I can exclusively avoid splitting at a file path?


Solution

  • I would use an re.findall approach here:

    inp = "-y -hwaccel cuda -threads 8 -loglevel error -hide_banner -stats -i - -c:v hevc_nvenc -rc constqp -preset p7 -qp 18 C:\Users\User\Documents\Python\Smoothie\test 124\Resampled_vid.mp4"
    parts = re.findall(r'[A-Z]+:(?:\\[^\\]+)+\.\w+|\S+', inp)
    print(parts)
    
    ['-y', '-hwaccel', 'cuda', '-threads', '8', '-loglevel', 'error', '-hide_banner',
     '-stats', '-i', '-', '-c:v', 'hevc_nvenc', '-rc', 'constqp', '-preset', 'p7',
     '-qp', '18',
     'C:\\Users\\User\\Documents\\Python\\Smoothie\test 124\\Resampled_vid.mp4']
    

    The regex pattern used here says to match, alternatively:

    [A-Z]+:(?:\\[^\\]+)+\.\w+  a file path
    |                          OR
    \S+                        any group of non whitespace characters
    

    The trick here is to eagerly try to match a file path first. Only that failing do we try to match one word/term at a time.