I have a custom PyTorch model that bottlenecks my application due to how it is currently used.
The application is a web server built in Flask that receives job submissions for the PyTorch model to process. Due to the processing time of each job, I use Celery to handle the computation, where Flask queues the tasks for Celery to execute.
Each job consists of loading the PyTorch model from the disk, moving the model and data to a GPU, and making a prediction on the data submitted. However, loading the model takes around 6 seconds. In many instances, that is a magnitude or two larger than prediction time.
Thus, is it possible to load the model and move it to a GPU on server startup (specifically when the Celery worker starts), avoiding the time needed to load the model and copy it to the GPU every job? Ideally, I'd want to load the model and copy it to every available GPU on server startup, leaving each Celery job to choose an available GPU and copy the data over. Currently, I only have one GPU, so a multi-GPU solution is not a requirement at the moment, but I'm planning ahead.
Further, the memory constraints of the model and data allow for only one job per GPU at a time, so I have a single Celery worker that processes jobs sequentially. This could reduce the complexity of the solution due to avoiding multiple jobs attempting to use the model in shared memory at the same time, so I figured I'd mention it.
As of this moment, I am using PyTorch's multiprocessing package with the forkserver
start method, but I've had trouble determining exactly how this works and if it behaves in the way I prefer. If you have any input on my configuration or suggestion for a solution, please leave a comment! I'm open to efficiency suggestions, as I intend to scale this solution. Thank you!
Yes, there are ways to share a PyTorch model across multiple processes without creating copies.
torch.multiprocessing and model.share_memory_():
This method utilizes the torch.multiprocessing module from PyTorch. You can call model.share_memory_() on your model to allocate shared memory for its parameters. This allows all processes to access the same model parameters in shared memory, avoiding redundant copies. This approach is efficient for training a model in parallel across multiple CPU cores.
Some resources for further exploration: https://www.geeksforgeeks.org/getting-started-with-pytorch/