arrays matlab indexing cell-array equivalence-classes

Form array of equivalence related classes

I have an array in Matlab. I numbered every entry in array with natural number. So I formed equivalence relation in array.

For example,

array   = [1 2 3 5 6 7]
classes = [1 2 1 1 3 3].

I want to get cell array: i-th cell array's position is connected with i-th entry of initial array and shows, which elements are in the one class with this entry. For the example above, I would get:

{[1 3 5], [2], [1 3 5], [1 3 5], [6 7], [6 7]}

It can be done easily with for-loop, but is there any other solution? It will be good if it works faster than O(n^2), where n is the size of initial array.

Edit. Problem will be solved, if I know the approach to split sorted array into cells with indeces of equal elements by O(n).

array  = [1 1 1 2 3 3]
groups = {[1 2 3], [4], [5 6]}

Solution

I don't know how to do this without for-loops entirely, but you can use a combination of sort, diff, and find to organize and partition the equivalence class identifiers. That'll give you a mostly vectorized solution, where the M-code level for-loop is O(n) where n is the number of classes, not the length of the whole input array. This should be pretty fast in practice.

Here's a rough example using some index munging. Be careful; there's probably an off-by-one edge case bug in there somewhere since I just banged this out.

function [eqVals,eqIx] = equivsets(a,x)
%EQUIVSETS Find indexes of equivalent values

[b,ix] = sort(x);
ixEdges = find(diff(b)); % identifies partitions between equiv classes
ix2 = [0 ixEdges numel(ix)];
eqVals = cell([1 numel(ix2)-1]);
eqIx = cell([1 numel(ix2)-1]);
% Map back to original input indexes and values
for i = 1:numel(ix2)-1
    eqIx{i} = ix((ix2(i)+1):ix2(i+1));
    eqVals{i} = a(eqIx{i}); 
end

I included the indexes in the output because they're often more useful than the values themselves. You'd call it like this.

% Get indexes of occurrences of each class
equivs = equivsets(array, classes)
% You can expand that to get equivalences for each input element
equivsByValue = equivs(classes)

It's a lot more efficient to build the lists for each class first and then expand them out to match the input indexes. Not only do you have to do the work just once, but when you use the b = a(ix) to expand a small cell array to a larger one, Matlab's copy-on-write optimization will end up reusing the memory for the underlying numeric mxArrays so you get a more compact representation in memory.

This transformation pops up a lot when working with unique() or databases. For decision support systems and data warehouse style things I've worked with, it happens all over the place. I wish it were built in to Matlab. (And maybe it's been added to one of the db or timeseries toolboxes in recent years; I'm a few versions behind.)

Realistically, if performance of this is critical for your code, you might also look at dropping down to Java or C MEX functions and implementing it there. But if your data sets are low cardinality - that is, have a small number of classes/distinct values, like numel(unique(classes)) / numel(array) tends to be less than 0.1 or so - the M-code implementation will probably be just fine.