Search code examples
matlabindexingsparse-matrix

How can I merge together two co-occurrence matrices with overlapping but not identical vocabularies?


I'm looking at word co-occurrence in a number of documents. For each set of documents, I find a vocabulary of the N most frequent words. I then make an NxN matrix for each document representing whether the words occur together in the same context window (sequence of k words). This is a sparse matrix, so if I have M documents, I have an NxNxM sparse matrix. Because Matlab cannot store sparse matrices with more than 2 dimensions, I flatten this matrix into a (NxN)xM sparse matrix.

I face the problem that I generated 2 of these co-occurrence matrices for different sets of documents. Because the sets were different, the vocabularies are different. Instead of merging the sets of documents together and recalculating the co-occurrence matrix, I'd like to merge the two existing matrices together.

For example,

N = 5; % Size of vocabulary
M = 5; % Number of documents
A = ones(N*N, M); % A is a flattened (N, N, M) matrix
B = 2*ones(N*N, M); % B is a flattened (N, N, M) matrix
A_ind = {'A', 'B', 'C', 'D', 'E'}; % The vocabulary labels for A
B_ind = {'A', 'F', 'B', 'C', 'G'}; % The vocabulary labels for B

Should merge to produce a (49, 5) matrix, where each (49, 1) slice that can be reshaped into a (7,7) matrix with the following structure.

     A     B     C     D     E     F     G
 __________________________________________
 A|   3     3     3     1     1     2     2
 B|   3     3     3     1     1     2     2
 C|   3     3     3     1     1     2     2
 D|   1     1     1     1     1     0     0
 E|   1     1     1     1     1     0     0
 F|   2     2     2     0     0     2     2
 G|   2     2     2     0     0     2     2

Where A and B overlap, the co-occurrence counts should be added together. Otherwise, the elements should be the counts from A or the counts from B. There will be some elements (0's in the example) where I don't have count statistics because some of the vocabulary is exclusively in A and some is exclusively in B.


Solution

  • The key is to use the ability of logical indices to be flattened.

    A = ones(25, 5);
    B = 2*ones(25,5);
    A_ind = {'A', 'B', 'C', 'D', 'E'};
    B_ind = {'A', 'F', 'B', 'C', 'G'};
    
    new_ind = [A_ind, B_ind(~ismember(B_ind, A_ind))];
    new_size = length(new_ind)^2; 
    new_array = zeros(new_size, 5); 
    
    % Find the indices that correspond to elements of A
    A_overlap = double(ismember(new_ind, A_ind)); 
    A_mask = (A_overlap'*A_overlap)==1;
    
    % Find the indices that correspond to elements of B
    B_overlap = double(ismember(new_ind, B_ind)); 
    B_mask = (B_overlap'*B_overlap)==1;
    
    % Flatten the logical indices to assign the elements to the new array
    new_array(A_mask(:), :) = A;
    new_array(B_mask(:), :) = new_array(B_mask(:), :) + B;