Search code examples
pythonmachine-learningmultiprocessingdata-sciencelightgbm

Multiprocessing Pooling and Lightgbm


i am trying to train completely independent tasks using multiprocess pooling in python, which lightgbm for training(i am not sure if this is relevant for problem). Here is the code

from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
import lightgbm as lgb
import numpy as np

def functionToParallize(splitnumber=2):

    data = load_breast_cancer()
    X = pd.DataFrame(data.data, columns=data.feature_names)
    y = pd.Series(data.target)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    folds = KFold(splitnumber)
    results = lgb.cv({}, lgb.Dataset(X_train, y_train), folds=folds,metrics=['rmse'])
    return results

to parallelize i use multiprocessing pooling with say 2 pool workers. But the it's really inefficient as it takes 1000 times more time to finish a task with 2 pools as it takes with just one. For eg.

from multiprocessing import Pool 
import psutil
print(psutil.cpu_count())

output

4

starttime=time.time()
pool = Pool(2)
multiple_results = [pool.apply_async(functionToParallize) for i in [3]]
p=[res.get() for res in multiple_results]
print((time.time()-starttime)/60)

output

0.007067755858103434

but with two pools

starttime=time.time()
pool = Pool(2)
multiple_results = [pool.apply_async(functionToParallize) for i in [2,3]]
p=[res.get() for res in multiple_results]
print((time.time()-starttime)/60)

1.026989181836446

This is actually not the original task, but i am doing something similar there. But for that single task takes about a minute and pool 2 process never ends there at all. Am i doing something wrong here ?? I am doing this on a jupyter notebook, If that's relevant.

Any help is appreciated! Thanks!


Solution

  • I found the reason, it's because lgb's internal threading conflicts with custom pooling. Forcing lgb not to use threading helps.

    results = lgb.cv({'num_threads':1}, lgb.Dataset(X_train, y_train), folds=folds,metrics=['rmse'])
    

    Thanks Manzi, Cheers