I have to factorize a big sparse matrix ( 6.5mln rows representing users* 6.5mln columns representing items) to find users and items latent vectors. I chose the als algorithm in spark framework(pyspark). To boost the quality I have to reduce the sparsity of my matrix till 98%. (current value is 99.99% because I have inly 356mln of filled entries). I can do it by dropping rows or columns, but I must find the optimal solution maximizing number of rows(users). The main problem is that I must find some subsets of users and items sets, and dropping some row can drop some columns and vice versa, the second problem is that function that evaluates sparsity is not linear. Which way I can solve this problem? which libraries in python can help me with it? Thank you.
This is a combinatorial problem. There is no easy way to drop an optimal set of columns to achieve max number of users while reducing sparsity. A formal approach would be to formulate it as a mixed-integer program. Consider the following 0-1 matrix, derived from your original matrix C.
A(i,j) = 1 if C(i,j) is nonzero,
A(i,j) = 0 if C(i,j) is zero
Parameters:
M : a sufficiently big numeric value, e.g. number of columns of A (NCOLS)
N : total number of nonzeros in C (aka in A)
Decision vars are
x(j) : 0 or 1 implying whether column j is dropped or not
nr(i): number of nonzeros covered in row i
y(i) : 0 or 1 implying whether row i is dropped or not
Constraints:
A(i,:) x = nr(i) for i = 1..NROWS
nr(i) <= y(i) * M for i = 1..NROWS
@sum(nr(i)) + e = 0.98 * N # explicit slack 'e' to be penalized in the objective
y(i) and x(j) are 0-1 variables (binary variables) for i,j
Objective:
maximize @sum(y(i)) - N.e
Such a model would be extremely difficult to solve as an integer model. However, barrier methods should be able to solve the linear programming relaxations (LP) Possible solvers are Coin/CLP (open-source), Lindo (commercial) etc... It may then be possible to use the LP solution to compute approximate integer solutions by simple rounding.
In the end, you will definitely require an iterative approach which will require solving MF problem several times each time factoring a different submatrix of C, computed with above approach, until you are satisfied with the solution.