Search code examples
pythonbigdatasparse-matrixmatrix-factorizationnmf

Very Large and Very Sparse Non Negative Matrix factorization


I have a very large and also sparse matrix (531K x 315K), the number of total cells is ~167 Billion. The non-zero values are only 1s. Total number of non-zero values are around 45K. Is there an efficient NMF package to solve my problem? I know there are couple of packages for that and they are working well only for small size of data matrix. Any idea helps. Thanks in advance.


Solution

  • scikit-learn will handle this easily!

    Code:

    from time import perf_counter as pc
    import numpy as np
    import scipy.sparse as sps
    from sklearn.decomposition import NMF
    
    """ Create sparse data """
    nnz_i, nnz_j, nnz_val = np.random.choice(531000, size=45000), \
                            np.random.choice(315000, size=45000), \
                            np.random.random(size=45000)
    X =  sps.csr_matrix((nnz_val, (nnz_i, nnz_j)), shape=(531000, 315000))
    print('X-shape: ', X.shape, ' X nnzs: ', X.nnz)
    print('type(X): ', type(X))
    # <class 'scipy.sparse.csr.csr_matrix'> #                          !!!!!!!!!!
    
    """ NMF """
    model = NMF(n_components=50, init='random', random_state=0, verbose=True)
    
    start_time = pc()
    W = model.fit_transform(X)
    end_time = pc()
    
    print('Used (secs): ', end_time - start_time)
    print(model.reconstruction_err_)
    print(model.n_iter_)
    

    Output:

    X-shape:  (531000, 315000)  X nnzs:  45000
    type(X):  <class 'scipy.sparse.csr.csr_matrix'>
    violation: 1.0
    violation: 0.2318929397542804
    violation: 0.11045394409727402
    violation: 0.08104138988253409
    ...
    violation: 9.659665625799714e-05
    Converged at iteration 71
    Used (secs):  247.94092973091756
    122.27109041
    70
    

    Remarks:

    • Make sure you use sparse-matrices as input or you can't exploit sparsity
    • I'm using version 0.19.1, so the multiplicative-update solver is used (>= 0.19)
      • But the older CD-based solver should handle this too!
    • The above is using < 800 MB of memory

    Additional Constraints

    As mentioned in the comments, OP wants to add additional constraints, while still not specifying these formally.

    This will need a whole new implementation of some optimization-procedure including some theory-footwork (depending on the constraints).

    As an alternative, this can be solved by general-purpose Convex-Programming solvers. E.g. formulated by cvxpy and solved by SCS. Of course the alternating-minimization procedure needs to be done too (as the joint-problem is non-convex) and it will scale worse than this specialized sklearn-implementation. But it might work for OPs data.