I am building and fitting an hdbscan model on my data and when I run the script from within the file it works well and quickly, but when I import the file and run it from 'outside' it goes into a weird loop that I don't understand how it started. And I get the following error:
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == '__main__'". Please see the joblib documentation on Parallel for more information
Here is an excerpt of the code:
df_pos_raw, df_pos_training = pre_process_data(df_pos)
df_pos_training_std = standardize_df(df_pos_training) # Standardized data, column-wise
print "generating model"
pos_cls = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
print "fitting model to data"
pos_cls.fit(df_pos_training_std)
print 'done fitting model'
# sns.distplot(pos_cls.labels_, bins=len(set(pos_cls.labels_)))
df_filtered = filter_cons_types(df, [3, 5])
print "Done. returning variables"
return pos_cls, df_filtered
Here is the output when running from 'outside' the file:
Traceback (most recent call last):
File "<string>", line 1, in <module>
generating model
File "C:\ProgramData\Anaconda2\Lib\multiprocessing\forking.py", line 380, in main
fitting model to data
prepare(preparation_data)
File "C:\ProgramData\Anaconda2\Lib\multiprocessing\forking.py", line 510, in prepare
'__parents_main__', file, path_name, etc
File "C:\Users\sareetn\PycharmProjects\Arad\DataImputation\ClusteringExtrapolation\Dev\run_clustering_based_prediction.py", line 4, in <module>
model, raw_df = clustering()
File "C:\Users\sareetn\PycharmProjects\Arad\DataImputation\ClusteringExtrapolation\Dev\clustering_model_constype_3_5.py", line 86, in main
pos_cls.fit(df_pos_training_std)
File "C:\Users\sareetn\PycharmProjects\Arad\venv\lib\site-packages\hdbscan\hdbscan_.py", line 816, in fit
self._min_spanning_tree) = hdbscan(X, **kwargs)
File "C:\Users\sareetn\PycharmProjects\Arad\venv\lib\site-packages\hdbscan\hdbscan_.py", line 543, in hdbscan
core_dist_n_jobs, **kwargs)
File "C:\Users\sareetn\PycharmProjects\Arad\venv\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
return self.func(*args, **kwargs)
File "C:\Users\sareetn\PycharmProjects\Arad\venv\lib\site-packages\hdbscan\hdbscan_.py", line 239, in _hdbscan_boruvka_kdtree
n_jobs=core_dist_n_jobs, **kwargs)
File "hdbscan/_hdbscan_boruvka.pyx", line 375, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__ (hdbscan/_hdbscan_boruvka.c:5195)
File "hdbscan/_hdbscan_boruvka.pyx", line 411, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds (hdbscan/_hdbscan_boruvka.c:5915)
File "C:\Users\sareetn\PycharmProjects\Arad\venv\lib\site-packages\sklearn\externals\joblib\parallel.py", line 749, in __call__
n_jobs = self._initialize_backend()
File "C:\Users\sareetn\PycharmProjects\Arad\venv\lib\site-packages\sklearn\externals\joblib\parallel.py", line 547, in _initialize_backend
**self._backend_args)
File "C:\Users\sareetn\PycharmProjects\Arad\venv\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 305, in configure
'[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == '__main__'". Please see the joblib documentation on Parallel for more information
generating model
fitting model to data
generating model
fitting model to data
generating model
fitting model to data
Thank you very much in advance!!
A friend helped me figure it out-
Clustering uses a library called joblib that splits the job into parallel processes. When running such functions on a Windows machine, care needs to be taken to make sure we use
if __name__ == '__main__'
in order to protect the code and allow the parallel processing to work. After adding
if __name__ == '__main__'
and placing all of the code there, the clustering ran smoothly and quickly