Search code examples
matlabperformancepermutationcell-array

Find unique rows of a cell array considering all possible permutations on each row


I have cell array A of dimension m * k.

I want to keep the rows of A unique up to an order of the k cells.

The "tricky" part is "up to an order of the k cells": consider the k cells in the ith row of A, A(i,:); there could be a row j of A, A(j,:), that is equivalent to A(i,:) up to a re-ordering of its k cells, meaning that for example if k=4it could be that:

A{i,1}=A{j,2}
A{i,2}=A{j,3}
A{i,3}=A{j,1}
A{i,4}=A{j,4}

What I am doing at the moment is:

G=[0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; 1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6]; 
h=7;
M=reshape(G(nchoosek(1:size(G,1),h),:),[],h,size(G,2));
A=cell(size(M,1),2);
for p=1:size(M,1)
    A{p,1}=squeeze(M(p,:,:)); 
    left=~ismember(G, A{p,1}, 'rows');
    A{p,2}=G(left,:); 
end

%To find equivalent rows up to order I use a double loop (VERY slow).
indices=[]; 
for j=1:size(A,1)
    if ismember(j,indices)==0 %if we have not already identified j as a duplicate
        for i=1:size(A,1)
            if i~=j
               if (isequal(A{j,1},A{i,1}) || isequal(A{j,1},A{i,2}))...
                  &&...
                  (isequal(A{j,2},A{i,1}) || isequal(A{j,2},A{i,2}))...
                  indices=[indices;i]; 
               end
            end
        end
    end
end
A(indices,:)=[];

It works but it is too slow. I am hoping that there is something quicker that I can use.


Solution

  • I'd like to propose another idea, which has some conceptual resemblance to erfan's. My idea uses hash functions, and specifically, the GetMD5 FEX submission.

    The main task is how to "reduce" each row in A to a single representative value (such as a character vector) and then find unique entries of this vector.

    Judging by the benchmark vs. the other suggestions, my answer doesn't perform as well as one of the alternatives, but I think its raison d'être lies in the fact that it is completely data-type agnostic (within the limitations of the GetMD51), that the algorithm is very straightforward to understand, it's a drop-in replacement as it operates on A, and that the resulting array is exactly equal to the one obtained by the original method. Of course this requires a compiler to get working and has a risk of hash collisions (which might affect the result in VERY VERY rare cases).

    Here are the results from a typical run on my computer, followed by the code:

    Original method timing:     8.764601s
    Dev-iL's method timing:     0.053672s
    erfan's method timing:      0.481716s
    rahnema1's method timing:   0.009771s
    

    function q39955559
    G=[0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; 1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6]; 
    h=7;
    M=reshape(G(nchoosek(1:size(G,1),h),:),[],h,size(G,2));
    A=cell(size(M,1),2);
    for p=1:size(M,1)
        A{p,1}=squeeze(M(p,:,:)); 
        left=~ismember(G, A{p,1}, 'rows');
        A{p,2}=G(left,:); 
    end
    
    %% Benchmark:
    tic
    A1 = orig_sort(A);
    fprintf(1,'Original method timing:\t\t%fs\n',toc);
    
    tic
    A2 = hash_sort(A);
    fprintf(1,'Dev-iL''s method timing:\t\t%fs\n',toc);
    
    tic
    A3 = erfan_sort(A);
    fprintf(1,'erfan''s method timing:\t\t%fs\n',toc);
    
    tic
    A4 = rahnema1_sort(G,h);
    fprintf(1,'rahnema1''s method timing:\t%fs\n',toc);
    
    assert(isequal(A1,A2))
    assert(isequal(A1,A3))
    assert(isequal(numel(A1),numel(A4)))  % This is the best test I could come up with...
    
    function out = hash_sort(A)
    % Hash the contents:
    A_hashed = cellfun(@GetMD5,A,'UniformOutput',false);
    % Sort hashes of each row:
    A_hashed_sorted = A_hashed;
    for ind1 = 1:size(A_hashed,1)
      A_hashed_sorted(ind1,:) = sort(A_hashed(ind1,:));
    end
    A_hashed_sorted = cellstr(cell2mat(A_hashed_sorted));
    % Find unique rows:
    [~,ia,~] = unique(A_hashed_sorted,'stable');
    % Extract relevant rows of A:
    out = A(ia,:);
    
    function A = orig_sort(A)
    %To find equivalent rows up to order I use a double loop (VERY slow).
    indices=[]; 
    for j=1:size(A,1)
        if ismember(j,indices)==0 %if we have not already identified j as a duplicate
            for i=1:size(A,1)
                if i~=j
                   if (isequal(A{j,1},A{i,1}) || isequal(A{j,1},A{i,2}))...
                      &&...
                      (isequal(A{j,2},A{i,1}) || isequal(A{j,2},A{i,2}))...
                      indices=[indices;i]; 
                   end
                end
            end
        end
    end
    A(indices,:)=[];
    
    function C = erfan_sort(A)
    STR = cellfun(@(x) num2str((x(:)).'), A, 'UniformOutput', false);
    [~, ~, id] = unique(STR);
    IC = sort(reshape(id, [], size(STR, 2)), 2);
    [~, col] = unique(IC, 'rows');
    C = A(sort(col), :); % 'sort' makes the outputs exactly the same.
    
    function A1 = rahnema1_sort(G,h)
    idx = nchoosek(1:size(G,1),h);
    %concatenate complements
    M = [G(idx(1:size(idx,1)/2,:),:), G(idx(end:-1:size(idx,1)/2+1,:),:)];
    %convert to cell so A1 is unique rows of A
    A1 = mat2cell(M,repmat(h,size(idx,1)/2,1),repmat(size(G,2),2,1));
    

    1 - If more complicated data types need to be hashed, one can use the DataHash FEX submission instead, which is somewhat slower.