Search code examples
pythonmultiprocessing

Need to understand multiprocessing


I am trying to apply multiprocessing to a code of mine. I will try to explain it as follows:

def convert_data(df):
    # A function to convert categorical/object datatypes to numerical format by performing OHE or vectorization.
    ...


def impute_data(df):
    # A function which imputes missing data using various methods.
    # Some models which are used for this are DecisionTree Classifier, Logistic Regression etc.
    # Uses the above convert_data() function on the training set.
    # Returns dataframe without any missing data
    ...

try:
    df = impute_data(df)
except Exception as e:
    print(e)

Now, inside the impute_data(), I try this:

def impute_data(df):
    ...
    X = df.drop([col], axis=1)
    data_chunks = np.array_split(X, 100)  # The data is having ~5 Million rows
    with Pool() as pool:
        res = pool.map(convert_data, data_chunks)
    X = pd.concat(res)
    ...

This results in an error:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Now I know this is happening due to the fact that Pool should be used inside if __name__ == "__main__", but I have to import this file in some other file for use. How should I proceed from here?


EDIT

If I run the above code like this:

def impute_data(pool, df):
    ...
    X = df.drop([col], axis=1)
    data_chunks = np.array_split(X, 100)  # The data is having ~5 Million rows
    res = pool.map(convert_data, data_chunks)
    X = pd.concat(res)
    ...


if __name__ == "__main__":
    try:
        with Pool() as _pool:
            df = impute_data(_pool, df)
    except Exception as e:
        print(e)

Will this work? This is assuming that this current file is now the main file and not getting imported anywhere else.


Solution

  • The other file that is importing it needs to properly use the same guard. If it doesn't, it's inherently incompatible with multiprocessing. Change the "some other file" (the main module mentioned in the error) so it guards its script-like behaviors properly, there's no reasonable way to solve this portably (e.g. on Windows where fork is straight up unavailable) anywhere else.