python numpy scipy sparse-matrix collaborative-filtering

How can I create a sparse matrix from coordinate information?

First of all, I want to summarize how I arrived at this particular problem. I wanted to create a song recommender using collaborative filtering method. But the problem is that I have a very large dataset at hand, 1m rows x 2.2m columns. If my understanding is correct, I needed to create a sparse matrix in order to move forward with my idea, since I do not know of anything that can hold a matrix with the size of 1m x 2.2m.* Hence, sparse matrix.

Now, since this matrix will only contain 1s or 0s in the cells, I've somehow mapped out which cells should have 1 if I were to create a hypothetical monstrous matrix. The information I have looks like this;

rows	locations
row1	[56110, 78999, 1508886, 2090010]
row2	[1123, 976554]
...	...
row1000000	[334555, 2200100]

The problem is that I don't know how to create a sparse matrix using this information. I've checked many sources but couldn't find any viable solution. If you could help me, I would very much appreciate it. Also, if you have any notes on collaborative filtering methods that utilize sparse matrices I would also be very grateful.

Solution

There are several ways you could do this. Here is one that creates a csr_matrix, since the data that you show is close to this format. (That docstring has a terse explanation of the csr_matrix attributes data, indices and indptr.) Whether or not this is the best method (for some definition of "best") depends on the actual "raw" form of your data (among other things).

I assume you can put the data that you show in the locations column into a list of lists, called locations. It is important that there is an entry in locations for each row, even if the list is empty. I also assume that the values given in locations are 0-based indices that correspond to the column of the matrix. Here's an example, for an array that has shape (5, 8).

In [23]: locations = [[2, 3], [], [1, 3, 5], [0, 1, 7], [7]]

To form indptr, we compute the cumulative sum of the lengths of the lists, and prepend a 0:

In [28]: lengths = np.array([len(t) for t in locations])

In [29]: lengths
Out[29]: array([2, 0, 3, 3, 1])

In [30]: indptr = np.concatenate(([0], lengths.cumsum()))

In [31]: indptr
Out[31]: array([0, 2, 2, 5, 8, 9])

indices is just the flattened version of locations. Note that sum() in the following is the Python builtin sum() function, not np.sum. That function call concatenates all the lists in locations.

In [32]: indices = sum(locations, start=[])

In [33]: indices
Out[33]: [2, 3, 1, 3, 5, 0, 1, 7, 7]

The data for the array is an array of 1s that is the same length as indices:

In [38]: data = np.ones_like(indices)

We now have all the pieces we need to create a SciPy csr_matrix:

In [39]: from scipy.sparse import csr_matrix

In [40]: A = csr_matrix((data, indices, indptr))

In [41]: A
Out[41]: 
<5x8 sparse matrix of type '<class 'numpy.int64'>'
    with 9 stored elements in Compressed Sparse Row format>

In [42]: A.toarray()
Out[42]: 
array([[0, 0, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1]])