Search code examples
matlabcell-array

Updating N-gram 2 dimension cell array in Matlab


I am trying to extract bi-grams from a set of words and store them in a matrix. what I want is to insert the word in the first raw and all the bi-grams related to that word

for example: if I have the following string 'database file there' my output should be:

database   file  there
da         fi    th
at         il    he
ta         le    er
ab               re
.. 

I have tried this but it gives me only the bigram without the original word

collection = fileread('e:\m.txt');
collection = regexprep(collection,'<.*?>','');
collection = lower(collection);
collection = regexprep(collection,'\W',' ');
collection = strtrim(regexprep(collection,'\s*',' '));
temp = regexprep(collection,' ',''',''');
eval(['words = {''',temp,'''};']);

word = char(words(1));
word2 =  regexp(word, sprintf('\\w{1,%d}', 1), 'match');     
bi = cellfun(@(x,y) [x '' y], word2(1:end-1)', word2(2:end)','un',0);

this is only for the first word however, i want to do that for every word in the "words" matrix 1X1000

is there an efficient way to accomplish this as I will deal with around 1 million words?

I am new to Matlab and if there any resource to explain how to deal with matrix (update elements, delete, ...) will be helpful

regards, Ashraf


Solution

  • If you were looking to get a cell array as the output, this might work for you -

    input_str = 'database file there' %// input
    
    str1_split = regexp(input_str,'\s','Split'); %// split words into cells
    NW = numel(str1_split); %// number of words
    char_arr1 = char(str1_split'); %//' convert split cells into a char array
    ind1 = bsxfun(@plus,[1:NW*2]',[0:size(char_arr1,2)-2]*NW); %//' get indices
                                               %// to be used for indexing into char array
    t1 = reshape(char_arr1(ind1),NW,2,[]);
    t2 = reshape(permute(t1,[2 1 3]),2,[])'; %//' char array with rows for each pair
    
    out = reshape(mat2cell(t2,ones(1,size(t2,1)),2),NW,[])'; %//'
    out(reshape(any(t2==' ',2),NW,[])')={''}; %//' Use only paired-elements cells
    out = [str1_split ; out] %// output
    

    Code Output -

    input_str =
    database file there
    
    out = 
        'database'    'file'    'there'
        'da'          'fi'      'th'   
        'at'          'il'      'he'   
        'ta'          'le'      'er'   
        'ab'          ''        're'   
        'ba'          ''        ''     
        'as'          ''        ''     
        'se'          ''        ''