Search code examples
pythonmachine-learningscikit-learncollaborative-filteringmatrix-factorization

Python Non negative Matrix Factorization that handles both zeros and missing data?


I look for a NMF implementation that has a python interface, and handles both missing data and zeros.

I don't want to impute my missing values before starting the factorization, I want them to be ignored in the minimized function.

It seems that neither scikit-learn, nor nimfa, nor graphlab, nor mahout propose such an option.

Thanks!


Solution

  • Using this Matlab to python code conversion sheet I was able to rewrite NMF from Matlab toolbox library.
    I had to decompose a 40k X 1k matrix with sparsity of 0.7%. Using 500 latent features my machine took 20 minutes for 100 iteration.

    Here is the method:

    import numpy as np
    from scipy import linalg
    from numpy import dot
    
    def nmf(X, latent_features, max_iter=100, error_limit=1e-6, fit_error_limit=1e-6):
        """
        Decompose X to A*Y
        """
        eps = 1e-5
        print 'Starting NMF decomposition with {} latent features and {} iterations.'.format(latent_features, max_iter)
        X = X.toarray()  # I am passing in a scipy sparse matrix
    
        # mask
        mask = np.sign(X)
    
        # initial matrices. A is random [0,1] and Y is A\X.
        rows, columns = X.shape
        A = np.random.rand(rows, latent_features)
        A = np.maximum(A, eps)
    
        Y = linalg.lstsq(A, X)[0]
        Y = np.maximum(Y, eps)
    
        masked_X = mask * X
        X_est_prev = dot(A, Y)
        for i in range(1, max_iter + 1):
            # ===== updates =====
            # Matlab: A=A.*(((W.*X)*Y')./((W.*(A*Y))*Y'));
            top = dot(masked_X, Y.T)
            bottom = (dot((mask * dot(A, Y)), Y.T)) + eps
            A *= top / bottom
    
            A = np.maximum(A, eps)
            # print 'A',  np.round(A, 2)
    
            # Matlab: Y=Y.*((A'*(W.*X))./(A'*(W.*(A*Y))));
            top = dot(A.T, masked_X)
            bottom = dot(A.T, mask * dot(A, Y)) + eps
            Y *= top / bottom
            Y = np.maximum(Y, eps)
            # print 'Y', np.round(Y, 2)
    
    
            # ==== evaluation ====
            if i % 5 == 0 or i == 1 or i == max_iter:
                print 'Iteration {}:'.format(i),
                X_est = dot(A, Y)
                err = mask * (X_est_prev - X_est)
                fit_residual = np.sqrt(np.sum(err ** 2))
                X_est_prev = X_est
    
                curRes = linalg.norm(mask * (X - X_est), ord='fro')
                print 'fit residual', np.round(fit_residual, 4),
                print 'total residual', np.round(curRes, 4)
                if curRes < error_limit or fit_residual < fit_error_limit:
                    break
    
    return A, Y
    

    Here I was using Scipy sparse matrix as input and missing values were converted to 0 using toarray() method. Therefore, the mask was created using numpy.sign() function. However, if you have nan values you could get same results by using numpy.isnan() function.