python machine-learning scipy pca dimensionality-reduction

Dimension Reduction (TSNE/PCA) on Sparse Matrix

I want to perform Dimension Reduction(DR) technique to visualize my data and how related they are to each other. I am planning to use Barnes-hut tsne but I am not able to get how to provide input to TSNE because the sample application has data in regular matrix form according to user guide. I have around 12 million records with 5000 distinct values and I can't store them into main memory. I want to perform dimension reduction (DR) so that I visualized those distinct values on 2-D scatter plot. I have data in adjacency list form (as it is too sparse).

Let say, I have following records:

These are supposed to be my first 6 records. And in this case, I have only 10 distinct values. And the above matrix (table) suggests that 1st record has 2,3 and 10 columns as 1 while other columns are 0 (adjacency list).

These distinct values are mapped to the words (labels) present in the document (record).

How can I perform fast-TSNE with such data. Or how I can convert this into the compatible format required by TSNE? Which language should I prefer?

I prefer to use Python or Matlab but anything else is also fine. Let me know your suggestions.

P.S. I have really high computing machine to do the task.

Solution

The Barnes-Hut t-SNE code doesn't support this out of the box, but it should be a relatively straightforward change in the code to make it support this. In particular, see the following line of code: https://github.com/lvdmaaten/bhtsne/blob/master/tsne.cpp#L123

This line fills row_P, col_P, and val_P with a NxN similarity matrix in row-compressed sparse matrix format. That is, row_P has N+1 elements that contain indices into col_P and val_P, which both have nnz elements (N is the number of rows and nnz the number of nonzero entries of the sparse matrix). The elements in val_P are assumed to be non-negative (for instance, Gaussian kernel values).

I think the easiest thing you can do is replace this function call by a call to a new function that computes the similarity matrix based on your own input data (operating on whatever sparse format is most convenient to you). You could even implement the similarity matrix computation in Matlab, and then write bit of Mex-code that gets the resulting sparse matrix and copies it into row_P, col_P, and val_P. This should be easy because Matlab also uses row-compressed sparse matrix format; have a look at the mxGetIr and mxGetJc Mex-functions.

The rest of the Barnes-Hut t-SNE code is agnostic of how the input similarities were computed, so you should not have to make any other changes.