I'm looking at word co-occurrence in a number of documents. For each set of documents, I find a vocabulary of the N most frequent words. I then make an NxN matrix for each document representing whether the words occur together in the same context window (sequence of k words). This is a sparse matrix, so if I have M documents, I have an NxNxM sparse matrix. Because Matlab cannot store sparse matrices with more than 2 dimensions, I flatten this matrix into a (NxN)xM sparse matrix.
I face the problem that I generated 2 of these co-occurrence matrices for different sets of documents. Because the sets were different, the vocabularies are different. Instead of merging the sets of documents together and recalculating the co-occurrence matrix, I'd like to merge the two existing matrices together.
For example,
N = 5; % Size of vocabulary
M = 5; % Number of documents
A = ones(N*N, M); % A is a flattened (N, N, M) matrix
B = 2*ones(N*N, M); % B is a flattened (N, N, M) matrix
A_ind = {'A', 'B', 'C', 'D', 'E'}; % The vocabulary labels for A
B_ind = {'A', 'F', 'B', 'C', 'G'}; % The vocabulary labels for B
Should merge to produce a (49, 5) matrix, where each (49, 1) slice that can be reshaped into a (7,7) matrix with the following structure.
A B C D E F G
__________________________________________
A| 3 3 3 1 1 2 2
B| 3 3 3 1 1 2 2
C| 3 3 3 1 1 2 2
D| 1 1 1 1 1 0 0
E| 1 1 1 1 1 0 0
F| 2 2 2 0 0 2 2
G| 2 2 2 0 0 2 2
Where A and B overlap, the co-occurrence counts should be added together. Otherwise, the elements should be the counts from A or the counts from B. There will be some elements (0's in the example) where I don't have count statistics because some of the vocabulary is exclusively in A and some is exclusively in B.
The key is to use the ability of logical indices to be flattened.
A = ones(25, 5);
B = 2*ones(25,5);
A_ind = {'A', 'B', 'C', 'D', 'E'};
B_ind = {'A', 'F', 'B', 'C', 'G'};
new_ind = [A_ind, B_ind(~ismember(B_ind, A_ind))];
new_size = length(new_ind)^2;
new_array = zeros(new_size, 5);
% Find the indices that correspond to elements of A
A_overlap = double(ismember(new_ind, A_ind));
A_mask = (A_overlap'*A_overlap)==1;
% Find the indices that correspond to elements of B
B_overlap = double(ismember(new_ind, B_ind));
B_mask = (B_overlap'*B_overlap)==1;
% Flatten the logical indices to assign the elements to the new array
new_array(A_mask(:), :) = A;
new_array(B_mask(:), :) = new_array(B_mask(:), :) + B;