Search code examples
pythonregexffmpegfindall

Return Only Specific Values With Python Script


I am a total Python beginner and attempting to write a script that looks for black video and silent audio in a file, and returns only the time instances when they occur.

I have the following code working using the ffmpeg-python wrapper to get the values in stdout, but I can't figure out an efficient way to parse the stdout or stderror to return only the instances of black_start, black_end, black_duration, silence_start, silence_end, silence_duration.

Putting ffmpeg aside for those who are not experts, how can I use re.findall or similar to define the regex to return only the above values?

import ffmpeg 

input = ffmpeg.input(source)
video = input.video.filter('blackdetect', d=0, pix_th=0.00)
audio = input.audio.filter('silencedetect', d=0.1, n='-60dB')
out = ffmpeg.output(audio, video, 'out.null', format='null')
run = out.run_async(pipe_stdout=True, pipe_stderr=True)
result = run.communicate()

print(result)

This results in the ffmpeg output, which contains the results I need. Here is the output (edited for brevity) :

(b'', b"ffmpeg version 4.2.2 Copyright (c) 2000-2019 the FFmpeg developers
  built with Apple clang version 11.0.0 (clang-1100.0.33.17)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/4.2.2_3 --enable-shared --enable-pthreads --...
[silencedetect @ 0x7fdd82d011c0] silence_start: 0
frame=  112 fps=0.0 q=-0.0 size=N/A time=00:00:05.00 bitrate=N/A speed=9.96x    
[blackdetect @ 0x7fdd82e06580] black_start:0 black_end:5 black_duration:5
[silencedetect @ 0x7fdd82d011c0] silence_end: 5.06285 | silence_duration: 5.06285
frame=  211 fps=210 q=-0.0 size=N/A time=00:00:09.00 bitrate=N/A speed=8.97x    
frame=  319 fps=212 q=-0.0 size=N/A time=00:00:13.00 bitrate=N/A speed=8.63x    
frame=  427 fps=213 q=-0.0 size=N/A time=00:00:17.08 bitrate=N/A speed=8.51x    
frame=  537 fps=214 q=-0.0 size=N/A time=00:00:22.00 bitrate=N/A speed=8.77x    
frame=  650 fps=216 q=-0.0 size=N/A time=00:00:26.00 bitrate=N/A speed=8.63x    
frame=  761 fps=217 q=-0.0 size=N/A time=00:00:31.00 bitrate=N/A speed=8.82x    
frame=  874 fps=218 q=-0.0 size=N/A time=00:00:35.00 bitrate=N/A speed=8.71x    
frame=  980 fps=217 q=-0.0 size=N/A time=00:00:39.20 bitrate=N/A speed=8.67x    
...  
frame= 5680 fps=213 q=-0.0 size=N/A time=00:03:47.20 bitrate=N/A speed=8.53x    
[silencedetect @ 0x7fdd82d011c0] silence_start: 227.733
[silencedetect @ 0x7fdd82d011c0] silence_end: 229.051 | silence_duration: 1.3184
[silencedetect @ 0x7fdd82d011c0] silence_start: 229.051
[blackdetect @ 0x7fdd82e06580] black_start:229.28 black_end:230.24 black_duration:0.96
frame= 5757 fps=214 q=-0.0 Lsize=N/A time=00:03:50.28 bitrate=N/A speed=8.54x    
video:3013kB audio:43178kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[silencedetect @ 0x7fdd82d011c0] silence_end: 230.28 | silence_duration: 1.22856
\n")

What is the most efficient way to parse the output data to find/return only those result values so I can build further logic from them in my code? In this case, I would want only the following values returned:

silence_start: 0
silence_end: 5.06285
silence_duration: 5.06285

black_start:0
black_end:5
black_duration:5

silence_start: 227.733
silence_end: 229.051
silence_duration: 1.3184

black_start:229.28
black_end:230.24
black_duration:0.96

silence_start: 229.051
silence_end: 230.28
silence_duration: 1.22856

I've tried a bunch of different re.findall() options with regex, but the closest I've gotten is returning just the names of the values. For example, if I add this to the above:

found = re.findall('\\b' + 'silence_end' + '\\b', str(result))

print(found)

All I get are the names:

['silence_end', 'silence_end', 'silence_end']


Solution

  • You could 2 alternations to combine all the possibilities followed by matching 1+ digits with an optional dot and 1+ digits:

    \b(?:silence|black)_(?:start|end|duration):\s*\d+(?:\.\d+)?\b
    

    The pattern will match:

    • \b Word boundary
    • (?:silence|black)_ Match silence or black and an underscore
    • (?:start|end|duration):\s* Match start or end or duration, : and 0+ whitespace chars
    • \d+(?:\.\d+)? Match 1+ digit and optional dot an digits part
    • \b Word boundary

    Regex demo | Python demo

    For example

    import re
    test_str = "your string"
    regex = r"\b(?:silence|black)_(?:start|end|duration):\s*\d+(?:\.\d+)?\b"
    print(re.findall(regex, test_str))
    

    Output

    ['silence_start: 0', 'black_start:0', 'black_end:5', 'black_duration:5', 'silence_end: 5.06285', 'silence_duration: 5.06285', 'silence_start: 227.733', 'silence_end: 229.051', 'silence_duration: 1.3184', 'silence_start: 229.051', 'black_start:229.28', 'black_end:230.24', 'black_duration:0.96', 'silence_end: 230.28', 'silence_duration: 1.22856']