python machine-learning parallel-processing linear-regression python-scoop

Python Linear Regression in parallel - Scoop

I am trying to run in parallel a Linear Regression over 10000000 data point (4 features, 1 target variable) randomly generated from a Normal Distribution using Python's Scoop library. Here is the code:

import pandas as pd
import numpy as np
import random
from scoop import futures
import statsmodels.api as sm
from time import time

def linreg(vals):
    global model
    model = sm.OLS(y_vals,X_vals).fit()
    return model
    print(model.summary())    

if __name__ == '__main__':

random.seed(42)
vals = pd.DataFrame(np.random.normal(loc = 3, scale = 100, size =(10000000,5)))
vals.columns = ['dep', 'ind1', 'ind2', 'ind3', 'ind4']
y_vals = vals['dep']
X_vals = vals[['ind1', 'ind2', 'ind3', 'ind4']]

bt = time()
model_vals = list(map(linreg, [1,2,3]))
mval = model_vals[0]
print(mval.summary())
serial_time = time() - bt

bt1 = time()
model_vals_1 = list(futures.map(linreg, [1,2,3]))
mval_1 = model_vals_1[0]
print(mval_1.summary())
parallel_time = time() - bt1

print(serial_time, parallel_time)`

However, after that the regression summary is indeed produced in serial - via the Python's standard map function - an error:

Traceback (most recent call last): File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\site-packages\scoop\bootstrap__main__.py", line 302, in b.main() File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\site-packages\scoop\bootstrap__main__.py", line 92, in main self.run() File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\site-packages\scoop\bootstrap__main__.py", line 290, in run futures_startup() File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\site-packages\scoop\bootstrap__main__.py", line 271, in futures_startup run_name="main" File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\site-packages\scoop\futures.py", line 64, in _startup result = _controller.switch(rootFuture, *args, **kargs) File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\site-packages\scoop_control.py", line 253, in runController raise future.exceptionValue File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\site-packages\scoop_control.py", line 127, in runFuture future.resultValue = future.callable(*future.args, **future.kargs) File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "Scoop_map_linear_regression1.py", line 33, in model_vals_1 = list(futures.map(linreg, [1,2,3])) File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\site-packages\scoop\futures.py", line 102, in _mapGenerator for future in _waitAll(*futures): File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\site-packages\scoop\futures.py", line 358, in _waitAll for f in _waitAny(future): File "C:\Users\niccolo.gentile\AppData\Local\Continuum\anaconda3\envs\tensorenviron\lib\site-packages\scoop\futures.py", line 335, in _waitAny raise childFuture.exceptionValue NameError: name 'y_vals' is not defined

is produced afterwards. This means that the code stops at model_vals_1 = list(futures.map(linreg, [1,2,3]))

Please carefully note that in order to be able to run the code in parallel, it has to be launched from the command line specifying the -m scoop parameter, like this:

python -m scoop Scoop_map_linear_regression1.py

Indeed, should it be launched without the -m scoop parameter, it would not be parallelized and would indeed actually run, but just using two times the built in Python's map function (hence, running two times in serial), as how you would get reported in the Warnings. That is, without specifying the -m scoop parameter when launching it, futures.map would be replaced by map, while the goal is instead to indeed run it in parallel using futures.map.

This clarification is done so to avoid people answering that they solved the problem by simply launching the code without the -m scoop parameter, as already has happened here:

Python Parallel Computing - Scoop

where, as a consequence of this, the question was wrongly put on hold as off topic because no more reproducible.

Many thanks in advance and any comment is highly appreciated and welcome.

Solution

The solution is to pass, as second argument of futures.map (but not necessarily of map), only [1].

Indeed, even though the linreg function doesn't use the second argument passed to map, it still determines how many times the linreg function will be run. As an example, consider the following basic example:

def welcome(x):
    print('Hello world!')

if __name__ == '__main__':
    a = list(map(welcome, [1,2]))

The function welcome doesn't actually need any argument, but still the output will be

Hello world!
Hello world!

repeated two times, that is the length of the list passed as second argument.

In this specific case, this implies that the linear regression will be run 3 times by map, despite the fact that the regression output will appear just once as the summary is called outside the map.

The point is that, instead, it is not possible to run multiple times the linear regression with futures.map. The problem with is that, apparently, after the first run it actually deletes the used datasets, from which the impossibility to continue with the second and third run, and the consequent

NameError: name 'y_vals' is not defined

thrown at the end of the Trace. This should be visible by navigating over: scoop.futures source code

Didn't go over all of it, but I guess the problem should be related with greenlet switchers.