string performance file octave sparse-matrix

octave very slow when constructing word-sentence matrix

I have a vocabulary (vector of strings) and a file full of sentences. I want to construct a matrix that shows how often each sentence contains each word. My current implementation is dreadfully slow and I believe this can be much faster. It takes almost a minute for a single sentence of about ten words.

Can you explain why this is and how to speed it up?

Notes: I use a sparse matrix, as it would not otherwise fit into memory. The size of the vocabulary is roughly 10.000 words. Running the program does not exhaust my working memory, so that can't be the problem.

Here is the relevant code. Not mentioned variables are initialized previously, like totalLineCount, vocab and vocabCount.

% initiate sentence structure
wordSentenceMatrix = sparse(vocabCount, totalLineCount);
% fill the sentence structure
fid = fopen(fileLocation, 'r');
lineCount = 0;
while ~feof(fid),
    line = fgetl(fid);
    lineCount = lineCount + 1;
    line = strsplit(line, " ");
    % go through each word and increase the corresponding value in the matrix
    for j=1:size(line,2),
        for k=1:vocabCount,
            w1 = line(j);
            w2 = vocab(k);
            if strcmp(w1, w2),
                wordSentenceMatrix(k, lineCount) = wordSentenceMatrix(k, lineCount) + 1;
            end;
        end;
    end;
end;

Solution

A sparse matrix is actually stored in three arrays in the memory. In a simplified language you can depict its storage as one array of row indices, one array of column indices, and one array of nonzero entry values. (A little bit more complicated story is called compressed sparse column.)

By expanding the sparse matrix element by element in your code, you are changing the structure of that matrix (or sparsity pattern) repeatedly. This is not recommended because it involves a lot of memory copy.

Your way of querying the index of a word in the vocabulary is also very slow, because for each word in a sentence you are going through the entire vocabulary. A better way is to use a Java HashMap in Matlab.

I modified your code to the following:

rowIdx = [];
colIdx = [];
vocabHashMap = java.util.HashMap;
for k = 1 : vocabCount
    vocabHashMap.put(vocab{k}, k);
end

fid = fopen(fileLocation, 'r');
lineCount = 0;
while ~feof(fid),
    line = fgetl(fid);
    lineCount = lineCount + 1;
    line = strsplit(line, " ");
    % go through each word and increase the corresponding value in the matrix
    for j = 1 : length(line)
        rowIdx = [rowIdx; vocabHashMap.get(line{j})];
        colIdx = [colIdx; lineCount];
    end
end
assert(length(rowIdx) == length(colIdx));
nonzeros = length(rowIdx);
wordSentenceMatrix = sparse(rowIdx, colIdx, ones(nonzeros, 1));

Of course if you know the length of your text collection a priori, you should preallocate the memory of rowIdx and colIdx:

rowIdx = zeros(nonzeros, 1);
colIdx = zeros(nonzeros, 1);

Please port it to Octave if you can.