python mapreduce scikit-learn sklearn-pandas

How do I use Python's map function with sklearn's preprocessing.scale?

I am trying to use a function (preprocessing.scale) on a list of data. I am new to mapreduce/parallelism in Python - I would like to process this on a large list of data to improve performance.

Example:

X = [1,2,3,4]

Using the syntax:

list(map(preprocessing.scale, X))

I get this error:

TypeError: Singleton array array(1.0) cannot be considered a valid collection.

I think that is because of the return type of the function, but I am not sure how to fix this. Any help would be greatly appreciated!

Solution

You don't need/want to use map function as it does for loop under the hood.

Almost all sklearn methods are vectorized and they accept list-alike objects (lists, numpy arrays, etc.) and this would work much-much faster compared to map(...) approach

Demo:

In [121]: from sklearn.preprocessing import scale

In [122]: X = [1,2,3,4]

In [123]: scale(X)
Out[123]: array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])

the same demo using numpy array:

In [39]: x = np.array(X)

In [40]: x
Out[40]: array([1, 2, 3, 4])

In [41]: scale(x)
DataConversionWarning: Data with input dtype int32 was converted to float64 by the scale function.
  warnings.warn(msg, _DataConversionWarning)
Out[41]: array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])

it expects float dtype, so we can easily convert our numpy array to float dtype on the fly:

In [42]: scale(x.astype('float64'))
Out[42]: array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])