i am trying to train completely independent tasks using multiprocess pooling in python, which lightgbm for training(i am not sure if this is relevant for problem). Here is the code
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
import lightgbm as lgb
import numpy as np
def functionToParallize(splitnumber=2):
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y)
folds = KFold(splitnumber)
results = lgb.cv({}, lgb.Dataset(X_train, y_train), folds=folds,metrics=['rmse'])
return results
to parallelize i use multiprocessing pooling with say 2 pool workers. But the it's really inefficient as it takes 1000 times more time to finish a task with 2 pools as it takes with just one. For eg.
from multiprocessing import Pool
import psutil
print(psutil.cpu_count())
output
4
starttime=time.time()
pool = Pool(2)
multiple_results = [pool.apply_async(functionToParallize) for i in [3]]
p=[res.get() for res in multiple_results]
print((time.time()-starttime)/60)
output
0.007067755858103434
but with two pools
starttime=time.time()
pool = Pool(2)
multiple_results = [pool.apply_async(functionToParallize) for i in [2,3]]
p=[res.get() for res in multiple_results]
print((time.time()-starttime)/60)
1.026989181836446
This is actually not the original task, but i am doing something similar there. But for that single task takes about a minute and pool 2 process never ends there at all. Am i doing something wrong here ?? I am doing this on a jupyter notebook, If that's relevant.
Any help is appreciated! Thanks!
I found the reason, it's because lgb's internal threading conflicts with custom pooling. Forcing lgb not to use threading helps.
results = lgb.cv({'num_threads':1}, lgb.Dataset(X_train, y_train), folds=folds,metrics=['rmse'])
Thanks Manzi, Cheers