Search code examples
performancematlabcomparecellscell-array

Compare two cell arrays for identical rows - MATLAB


I have a cell string matrix with 40,000 lines and one with 400. I need to find those rows (lines) in the first matrix that fit the second. Please note that there could be many repetitions.

It looks: 40,000 lines like

Anna Frank  
Anna George  
Jane Peter  
Anna George  
Jane Peter    
etc.

here I need to find the fit to

Anna George  
Jane Peter  

The only way I found that far were two for functions and an if in between. But it is quite slow:

for i=2:size(bigTable,1)
    for j = 1: size(smallTable,1)
        if sum(ismember(bigTable(i,1:2),smallTable(j,1:2))) == 2
            Total_R(size(Total_R,1)+1,1)= i;
        end
    end
end

Solution

  • I am assuming your input is setup like this -

    bigTable = 
        'Anna'    'Frank' 
        'Anna'    'George'
        'Jane'    'Peter' 
        'Anna'    'George'
        'Jane'    'Peter' 
    smallTable = 
        'Anna'    'George'
        'Jane'    'Peter' 
    

    To solve your case, two approaches can be suggested here.

    Approach #1

    ismember based approach -

    Total_R = find(sum(ismember(bigTable,smallTable,'rows'),2)==2)
    

    Approach #2

    %// Assign unique labels to each cell for both small and big cell arrays, so that
    %// later on you would be dealing with numeric arrays only and 
    %// do not have to mess with cell arrays that were slowing you down
    [unqbig,matches1,idx] = unique([bigTable(:) ; smallTable(:)])
    big_labels = reshape(idx(1:numel(bigTable)),size(bigTable))
    small_labels = reshape(idx(numel(bigTable)+1:end),size(smallTable))
    
    %// Detect which rows from small_labels exactly match with those from big_labels
    Total_R  = find(ismember(big_labels,small_labels,'rows'))
    

    Or replace that ismember from the last line with a bsxfun based implementation -

    Total_R = find(any(all(bsxfun(@eq,big_labels,permute(small_labels,[3 2 1])),2),3))
    

    Output from these approaches for the assumed input case -

    Total_R =
         2
         3
         4
         5