python multithreading ffmpeg multiprocessing moviepy

How to utilize multiprocessing and multithreading efficiently to convert 1000s of video files to audio using python in parallel

I tried to convert video files to audio using moviepy python package. it works perfectly fine. However, I have 1500 100MB sized videos and I want to convert all of them to audio files. It takes a lot of time with the standard approach.

Code to convert one video file to audio:

import moviepy.editor as mp
clip = mp.VideoFileClip('file.mp4') 
clip.audio.write_audiofile(r"file.mp3")

I can also use threading to convert multiple files at the same time but I want to utilize multiprocessing and multithreading both to achieve the result most efficiently with less time complexity.

Algo using threading only:

clip1...clip10= make 10 lists with 150 files names from os.listdir()
spawn 10 threads to process 10 files at a time.

t1= Thread(target=convert, args=(clips1))
.
.
.
t10= Thread(target=convert, args=(clips2))

Any Ideas?

Solution

There is a situation where a combination of multithreading and multiprocessing can be advantageous, which is when the tasks being performed consist of neatly delineated parts where one part is primarily I/O bound (or at least relinquishes the Global Interpreter Lock frequently allowing other threads to run) and the other part is CPU intensive. An example would be where you need to perform multiple tasks consisting of two parts: (1) retrieve from a website a piece of information and (2) then do some non-trivial calculation using that information. Part 1 clearly is well-suited for multithreading since after the request is issued to retrieve the URL, the thread will go into a wait state allowing other threads to run. If part 2 were a trivial calculation, you would for simplicity sake just compute it within the thread. But since we are saying that it is non-trivial, performing the calculation in a separate process where we do not have to worry about contention for the Global Interpreter Lock (GIL) would be preferable.

The model for doing the above type of processing would be to create both a thread pool and a multiprocessing pool. "Jobs" are submitted to the thread pool worker function specifying the URL of the website from which the info needs to be retrieved as one argument and the multiprocessing pool as the other argument. The thread pool worker function first retrieves the needed info from the passed URL and then submits to a second worker function that performs the calculation using the passed multiprocessing pool.

That said, I don't quite see how your situation neatly divides along a purely I/O-bound part and a purely CPU-bound part. The call clip = mp.VideoFileClip('file.mp4') is clearly doing both I/O and processing the data for viewing. Likewise, clip.audio.write_audiofile(r"file.mp3") does CPU processing to convert a video clip to an audio clip, which I would think is primarily a CPU-bound process, and then writes out the file, which is clearly an I/O bound process.

If the API had been designed differently, where the reading and writing of the files were separate methods, then I think utilizing both threading and multiprocessing would be more viable. For example:

with open('file.mp4', 'rb') as f:
    mp4_file = f.read() # I/O
clip = mp.VideoClipFromMemory(mp4_file) # CPU
clip.convertToAudio() # CPU
clip.writeFile('file.mp3') # I/O

So the big question is: Is your "job" of converting from video to audio more CPU-bound or more I/O bound. If the former, then you should use a multiprocessing pool, which might possibly benefit by a pool size larger than the number of CPU cores that you have because processes will after all be going into wait states whenever they are waiting for I/O to complete since the jobs are not purely CPU-bound. If the latter, then you should use multithreading, since threads have less overhead involved in their creation. But I suspect you will do better with multiprocessing. The code below, with a couple of small changes can use either:

import moviepy.editor as mp
import glob
import os
from concurrent.futures import ProcessPoolExecutor as Executor
# To use multithreading:
# from concurrent.futures import ThreadPoolExecutor as Executor

def converter(filename):
    clip = mp.VideoFileClip(f'{filename}.mp4') 
    clip.audio.write_audiofile(f'{filename}.mp3')

def main():
    mp4_filenames = map(lambda x: x.split('.')[0], glob.iglob('*.mp4'))
    POOL_SIZE = os.cpu_count() # number of cores
    # You might want to try a larger size, especially if you are using a thread pool:
    with Executor(max_workers=POOL_SIZE) as executor:
        executor.map(converter, mp4_filenames)

# required for multiprocessing under Windows
if __name__ == '__main__':
    main()

Additional Comment/Suggestion

My suggestion would be to try on a small sample, say 100 files, both approaches (ProcessPoolExecutor and ThreadPoolExecutor) using the same pool size os.cpu_count() and running against the same 100 files just to see which one completes in less time. If it is the ProcessPoolExecutor run, you can then see if increasing the pool size helps in overlapping the I/O processing and improves throughput. If it is the ThreadPoolExecutor run, you can greatly increase the thread pool size until you see a decrease in performance. A thread pool size of 100 (or more when you are processing all the files) is not unreasonable.