Search code examples
pythonmatrixdatasetsparse-matrixdimensionality-reduction

Eliminate zero in sparse matrix dataset


i have a dataser in this way:

User    Movie
        0 1 2 3 4 
      0 2 0 5 0 0
      1 0 1 0 0 0
      2 0 5 5 5 0

from 1 to 5 is value of review of user for movies, otherwise is zero (no review).

I don't have a full columns, the data are all sparse.(at least one zero in column)

I seen that this introduce more noise in the data, because i have many value that really dont need. Which are the method to remove this noise? I remember that instead use zero, i can use a medium value, and after i simplify in some way, but I m not sure..

Any suggestion?


Solution

  • One idea when you have missing data (in your case, zeros) is to try to use the known data to fill the missing values. In other words, given a partial vector of features for an individual, we want to infer the remaining values. A trivial way to do this is to simply use the mean value for the missing column (of course, then the inferred value does not depend on the known values for that person or the values known for people like them!). You could also, for example, cluster users (using only known values that both individuals share) and compute mean values for missing columns just within each cluster.

    A very relevant literature to look into is the use of matrix completion for recommender systems (which in fact looks like what you are basically trying to do) and collaborative filtering. Imputation has been used but is rather expensive for large-scale datasets. Check out Koren et al, Matrix factorization techniques for recommender systems for some of the techniques used.

    Another outlook is to use semi-supervised probabilistic representation learning methods. Basically you learn a generative model of the data, such that you can partially specify a representation and automatically infer the remaining values. One caveat is this may be expensive, as you need to define a stochastic node per feature in this case. Consider, e.g., Siddarth et al, Learning Disentangled Representations with Semi-Supervised Deep Generative Models