Search code examples
pandaspython-multiprocessinggoogle-cloud-datalab

Is it possible to use Datalab with multiprocessing as a way to scale Pandas transformations?


I try to use Google Cloud Datalab to scale up data transformations in Pandas.

On my machine, everything works fine with small files (keeping the first 100000 rows of my file), but working with the full 8G input csv file led to a Memoryerror.

I though that a Datalab VM would help me. I first tried to use a VM with Highmem, going to up to 120 G or memory. There, I keep getting an error : The kernel appears to have died. It will restart automatically. I found something here : https://serverfault.com/questions/900052/datalab-crashing-despite-high-memory-and-cpu But I am not using TensorFlow, so it didn't help much.

So I tried a different approach, chunk processing and parallelize on more cores. It works well on my machine (4-cores, 12 G ram), but still requires hours of computation.

So I wanted to use a Datalab VM with 32 cores to speed things up, but here after 5 hours, the first threads still didn't finish, when on my local machine already 10 are completed.

So very simply:

Is it possible to use Datalab as a way to scale Pandas transformations ? Why do I get worst results with a theoretically much better VM than my local machine ?

Some code:

import pandas as pd
import numpy as np
from OOS_Case.create_features_v2 import process
from multiprocessing.dummy import Pool as ThreadPool 



df_pb = pd.read_csv('---.csv')
list_df = []
for i in range(-) :
    df = df_pb.loc[---]
    list_df.append(df)



pool = ThreadPool(4) 
pool.map(process, list_df)

All the operations in my process function are pure Pandas and Numpy operations

Thanks for any tip, alternative or best practice advice you could give me !


Solution

  • It seems that a GCP Datalab has not supported multithreading:

    Each kernel is single threaded. Unless you are running multiple notebooks at the same time, multiple cores may not provide significant benefit.

    More information you can find here