I am trying to use a function (preprocessing.scale) on a list of data. I am new to mapreduce/parallelism in Python - I would like to process this on a large list of data to improve performance.
Example:
X = [1,2,3,4]
Using the syntax:
list(map(preprocessing.scale, X))
I get this error:
TypeError: Singleton array array(1.0) cannot be considered a valid collection.
I think that is because of the return type of the function, but I am not sure how to fix this. Any help would be greatly appreciated!
You don't need/want to use map function as it does for loop
under the hood.
Almost all sklearn
methods are vectorized and they accept list-alike objects (lists, numpy arrays, etc.) and this would work much-much faster compared to map(...)
approach
Demo:
In [121]: from sklearn.preprocessing import scale
In [122]: X = [1,2,3,4]
In [123]: scale(X)
Out[123]: array([-1.34164079, -0.4472136 , 0.4472136 , 1.34164079])
the same demo using numpy array:
In [39]: x = np.array(X)
In [40]: x
Out[40]: array([1, 2, 3, 4])
In [41]: scale(x)
DataConversionWarning: Data with input dtype int32 was converted to float64 by the scale function.
warnings.warn(msg, _DataConversionWarning)
Out[41]: array([-1.34164079, -0.4472136 , 0.4472136 , 1.34164079])
it expects float dtype, so we can easily convert our numpy array to float dtype on the fly:
In [42]: scale(x.astype('float64'))
Out[42]: array([-1.34164079, -0.4472136 , 0.4472136 , 1.34164079])