python machine-learning scipy scikit-learn sparse-matrix

Scale (apply function?) sparse matrix logarithmically

I am using scikit-learn preprocessing scaling for sparse matrices.

My goal is to "scale" each feature-column by taking the logarithm-base the column maximum value. My wording may be inexact. I try to explain.

Say feature-column has values: 0, 8, 2:

Max value = 8
Log-8 of feature value 0 should be 0.0 = math.log(0+1, 8+1) (the +1 is to cope with zeros; so yes, we are actually taking log-base 9)
Log-8 of feature value 8 should be 1.0 = math.log(8+1, 8+1)
Log-8 of feature value 2 should be 0.5 = math.log(2+1, 8+1)

Yes, I can easily apply any arbitrary function-based transformer with FunctionTransformer, but I want the base of the log change (based on) each column (in particular, the maximum value). That is, I want to do something like the MaxAbsScaler, only taking logarithms.

I see that MaxAbsScaler gets first a vector (scale) of the maximum values of each column (code) and then multiples the original matrix times 1 / scale in code.

However, I don't know what to do if I want to take the logarithms-based on the scale vector. Is it even possible to transform the logarithm operation to a multiplication (?) or do I have other possibilities of scipy sparse operations that are efficient?

I hope my intent is clear (and possible).

Solution

Logarithm of x in base b is the same as log(x)/log(b), where logs are natural. So, the process you describe amounts to first applying log(x+1) transformation to everything, and then scaling by max absolute value. Conveniently, log(x+1) is a built-in function, log1p. Example:

from sklearn.preprocessing import FunctionTransformer, maxabs_scale
from scipy.sparse import csc_matrix
import numpy as np
logtran = FunctionTransformer(np.log1p, accept_sparse=True)
X = csc_matrix([[ 1., 0, 8], [ 2., 0,  0], [ 0,  1., 2]])
Y = maxabs_scale(logtran.transform(X))

Output (sparse matrix Y):

  (0, 0)        0.630929753571
  (1, 0)        1.0
  (2, 1)        1.0
  (0, 2)        1.0
  (2, 2)        0.5