Search code examples
pythonmapreducescikit-learnsklearn-pandas

How do I use Python's map function with sklearn's preprocessing.scale?


I am trying to use a function (preprocessing.scale) on a list of data. I am new to mapreduce/parallelism in Python - I would like to process this on a large list of data to improve performance.

Example:

X = [1,2,3,4]

Using the syntax:

list(map(preprocessing.scale, X))

I get this error:

TypeError: Singleton array array(1.0) cannot be considered a valid collection.

I think that is because of the return type of the function, but I am not sure how to fix this. Any help would be greatly appreciated!


Solution

  • You don't need/want to use map function as it does for loop under the hood.

    Almost all sklearn methods are vectorized and they accept list-alike objects (lists, numpy arrays, etc.) and this would work much-much faster compared to map(...) approach

    Demo:

    In [121]: from sklearn.preprocessing import scale
    
    In [122]: X = [1,2,3,4]
    
    In [123]: scale(X)
    Out[123]: array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])
    

    the same demo using numpy array:

    In [39]: x = np.array(X)
    
    In [40]: x
    Out[40]: array([1, 2, 3, 4])
    
    In [41]: scale(x)
    DataConversionWarning: Data with input dtype int32 was converted to float64 by the scale function.
      warnings.warn(msg, _DataConversionWarning)
    Out[41]: array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])
    

    it expects float dtype, so we can easily convert our numpy array to float dtype on the fly:

    In [42]: scale(x.astype('float64'))
    Out[42]: array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])