Search code examples
arraysmatlabvectorizationreverse-lookup

Reverse lookup with non-unique values


What I'm trying to do

I have an array of numbers:

>> A = [2 2 2 2 1 3 4 4];

And I want to find the array indices where each number can be found:

>> B = arrayfun(@(x) {find(A==x)}, 1:4);

In other words, this B should tell me:

>> for ii=1:4, fprintf('Item %d in location %s\n',ii,num2str(B{ii})); end
Item 1 in location 5
Item 2 in location 1  2  3  4
Item 3 in location 6
Item 4 in location 7  8

It's like the 2nd output argument of unique, but instead of the first (or last) occurrence, I want all the occurrences. I think this is called a reverse lookup (where the original key is the array index), but please correct me if I'm wrong.

How can I do it faster?

What I have above gives the correct answer, but it scales terribly with the number of unique values. For a real problem (where A has 10M elements with 100k unique values), even this stupid for loop is 100x faster:

>> B = cell(max(A),1);
>> for ii=1:numel(A), B{A(ii)}(end+1)=ii; end

But I feel like this can't possibly be the best way to do it.

We can assume that A contains only integers from 1 to the max (because if it doesn't, I can always pass it through unique to make it so).


Solution

  • That's a simple task for accumarray:

    out = accumarray(A(:),(1:numel(A)).',[],@(x) {x})  %'
    
    out{1} = 5
    out{2} = 3 4 2 1
    out{3} = 6
    out{4} = 8 7  
    

    However accumarray suffers from not being stable (in the sense of unique's feature), so you might want to have a look here for a stable version of accumarray, if that's a problem.


    Above solution also assumes A to be filled with integers, preferably with no gaps in between. If that is not the case, there is no way around a call of unique in advance:

    A = [2.1 2.1 2.1 2.1 1.1 3.1 4.1 4.1];
    
    [~,~,subs] = unique(A)
    out = accumarray(subs(:),(1:numel(A)).',[],@(x) {x})
    

    To sum up, the most generic solution, working with floats and returning a sorted output could be:

    [~,~,subs] = unique(A)
    [subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1));  %// optional
    vals = 1:numel(A);                                   
    vals = vals(I);                                      %// optional
    out = accumarray(subs, vals , [],@(x) {x});
    
    out{1} = 5
    out{2} = 1 2 3 4
    out{3} = 6
    out{4} = 7 8  
    

    Benchmark

    function [t] = bench()
        %// data
        a = rand(100);
        b = repmat(a,100);
        A = b(randperm(10000));
    
        %// functions to compare
        fcns = {
            @() thewaywewalk(A(:).');
            @() cst(A(:).');
        }; 
    
        % timeit
        t = zeros(2,1);
        for ii = 1:100;
            t = t + cellfun(@timeit, fcns);
        end
        format long
    end
    
    function out = thewaywewalk(A) 
        [~,~,subs] = unique(A);
        [subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1));
        idx = 1:numel(A);
        out = accumarray(subs, idx(I), [],@(x) {x});
    end
    function out = cst(A) 
        [B, IX] = sort(A);
        out  = mat2cell(IX, 1, diff(find(diff([-Inf,B,Inf])~=0)));
    end
    

    0.444075509687511  %// thewaywewalk
    0.221888202987325  %// CST-Link
    

    Surprisingly the version with stable accumarray is faster than the unstable one, due to the fact that Matlab prefers sorted arrays to work on.