algorithm machine-learning classification sparse-matrix knn

Handling Incomplete Data (Data Sparsity) in kNN

I am trying to create a simple recommender system using knn.

Lets say I have some a table:

User | Book1 | Book2 | Book3 | Book4 | Book5 | Book6 | Book7 |
1    | 5     | ?     | 3     | ?     | 4     | 3     | 2     |
2    | 3     | 4     | ?     | 2     | 3     | 4     | 2     |
3    | 4     | 2     | 1     | ?     | ?     | 3     | 3     |
4    | 2     | 5     | 3     | ?     | 4     | 1     | 1     |
5    | 1     | 1     | 4     | 3     | 1     | ?     | 1     |
6    | 5     | 2     | 5     | 4     | 4     | 2     | ?     |

So if to find the possible scores for User 1, I was thinking that just take the absolute difference of the books user 1 read with other users. Then I would use that difference to find out which user from that list is "closest" to user 1. But in real world situation, there would be more ?/unknown scores. So how do I deal with those unknown scores when using knn?

I don't have any code, as I've yet to really understand how to implement this.

Any help is appreciated!

Solution

You don't have "unknown features" you have incomplete data points.

This is actually well-known problem in kNN and and there is a thoroughly validated pattern for dealing with it.

Although the problem is actually an "incomplete data" problem, in the kNN context it's often (usually?) referred to as the sparsity problem.

In practice, the sparsity problem in building knn models is, with the possible exception of efficient storage/retrieval of the data that comprise the model, the crux of kNN.

For instance, consider Amazon.com's recommendation engine, in which product ratings as user features comprising the columns and users comprising the rows, for this matrix to be 100% complete, every Amazon customer would have to have purchased and reviewed every single porduct Amazon sells. The actual sparsity of this matrix must be > 95%.

The most common technique (and which is still state-of-the-art as far as i know) is known as NNMA, or non-negative matrix approximation. This technique is also often referred to incorrectly as NNMF, in which F stands for factorization. (NNMA is based on a factorization technique, but the result is not factors of the original data matrix.) I mention this because this alternate term, though incorrect is widely used so i would include it in my search engine queries.

In essence, this techique can be used to remove sparsity from a matrix, or put another way, to populate the missing cells (i.e., the customer at row R has not reviwed the product of column C).

You can find a complete implementation of nnma, including an accompanying tutorial (in python + numpy) in Albert Au Yeung Ching-man's blog.

Alternatively, there are several python packages (available via PyPI) that contain packaged code for NNMA. I have only used one of these, PyMF, which you can find at Google Code.

So that you can see how NNMA works its magic, here is my simple but complete implementation of NNMA in python + NumPy:

import numpy as NP

def cf(q, v):
    """ the cost function """
    qv = (q - v)**2
    return NP.sum(NP.sum(qv, axis=0))


def nnma(d, max_iter=100):
    x, y = d.shape
    z = y
    w = NP.random.rand(x, y)
    h = NP.random.rand(y, z)
    for i in range(max_iter):
        wh = NP.dot(w, h)
        cost = cf(d, wh)
        if cost == 0: 
            break
        hn = NP.dot(w.T, d)
        hd = NP.dot(NP.dot(w.T, w), h)
        h *= hn/hd
        wn = NP.dot(d, h.T)
        wd = NP.dot(NP.dot(w, h), h.T)
        w *= wn/wd
    return NP.dot(w, h)

To use this NNMA function, just pass in a 2D array (matrix) with a "0" for each missing cell (in other words, your data matrix, with a "0" inserted for each missing value):

>>> d    # the original (sparse) data matrix with missing cells denoted by "0"s

  array([[ 7.,  0.,  4.,  7.,  0.,  1.],
         [ 3.,  9.,  7.,  3.,  1.,  7.],
         [ 4.,  4.,  3.,  7.,  3.,  9.],
         [ 4.,  8.,  0.,  9.,  2.,  1.],
         [ 6.,  3.,  9.,  5.,  9.,  3.],
         [ 6.,  1.,  4.,  4.,  1.,  0.],
         [ 0.,  4.,  8.,  6.,  0.,  5.],
         [ 9.,  0.,  6.,  0.,  5.,  2.],
         [ 6.,  8.,  4.,  6.,  3.,  7.],
         [ 3.,  6.,  3.,  8.,  7.,  2.]])

>>> d1 = nnma(d)     # call nnma, passing in the original data matrix

>>> d1    # the approximated data matrix with all missing values populated

   array([[ 6.998,  0.29 ,  3.987,  7.008,  0.292,  0.796],
          [ 2.989,  8.92 ,  6.994,  3.02 ,  1.277,  7.053],
          [ 4.007,  4.496,  2.999,  7.01 ,  3.107,  8.695],
          [ 4.005,  8.019,  0.254,  9.002,  1.917,  0.89 ],
          [ 5.998,  3.014,  9.001,  4.991,  8.983,  3.052],
          [ 5.992,  1.077,  4.007,  3.976,  0.753,  0.464],
          [ 0.346,  3.436,  7.993,  5.988,  0.194,  5.355],
          [ 9.001,  0.124,  5.997,  0.375,  5.02 ,  1.867],
          [ 6.   ,  7.994,  3.998,  6.   ,  2.999,  7.009],
          [ 2.995,  6.022,  3.001,  7.987,  6.939,  2.185]])

So as you can see, the results aren't too bad, particularly for a very simple implementation. All of the missing items are populated, and the remainder of the values are pretty close to the corresponding value from the original data matrix, e.g., column 0, row 0 is 7.0 in the original data matrix, and 6.998 in the approximated one.