Search code examples
pythonmachine-learningscipyscikit-learnsparse-matrix

Scale (apply function?) sparse matrix logarithmically


I am using scikit-learn preprocessing scaling for sparse matrices.

My goal is to "scale" each feature-column by taking the logarithm-base the column maximum value. My wording may be inexact. I try to explain.

Say feature-column has values: 0, 8, 2:

  • Max value = 8
  • Log-8 of feature value 0 should be 0.0 = math.log(0+1, 8+1) (the +1 is to cope with zeros; so yes, we are actually taking log-base 9)
  • Log-8 of feature value 8 should be 1.0 = math.log(8+1, 8+1)
  • Log-8 of feature value 2 should be 0.5 = math.log(2+1, 8+1)

Yes, I can easily apply any arbitrary function-based transformer with FunctionTransformer, but I want the base of the log change (based on) each column (in particular, the maximum value). That is, I want to do something like the MaxAbsScaler, only taking logarithms.

I see that MaxAbsScaler gets first a vector (scale) of the maximum values of each column (code) and then multiples the original matrix times 1 / scale in code.

However, I don't know what to do if I want to take the logarithms-based on the scale vector. Is it even possible to transform the logarithm operation to a multiplication (?) or do I have other possibilities of scipy sparse operations that are efficient?

I hope my intent is clear (and possible).


Solution

  • Logarithm of x in base b is the same as log(x)/log(b), where logs are natural. So, the process you describe amounts to first applying log(x+1) transformation to everything, and then scaling by max absolute value. Conveniently, log(x+1) is a built-in function, log1p. Example:

    from sklearn.preprocessing import FunctionTransformer, maxabs_scale
    from scipy.sparse import csc_matrix
    import numpy as np
    logtran = FunctionTransformer(np.log1p, accept_sparse=True)
    X = csc_matrix([[ 1., 0, 8], [ 2., 0,  0], [ 0,  1., 2]])
    Y = maxabs_scale(logtran.transform(X))
    

    Output (sparse matrix Y):

      (0, 0)        0.630929753571
      (1, 0)        1.0
      (2, 1)        1.0
      (0, 2)        1.0
      (2, 2)        0.5