Search code examples
stringmatlabcell-arrayunique-index

Matlab cell arrays of strings: finding repeated strings and manipulating corresponding data


I have a large set of data (~1 million entries), stored as a cell with a number of columns and many, many rows. My issue is that I need to identify entries that occur at the same time, and then manipulate other columns so as to remove the rows with repeated dates without losing all of the information.

An example of a subset of such data could be initialized as so;

data = {'10:30', 100; '10:30', 110; '10:31', 115;'10:32', 110}

That is, I have a cell with one column of strings (representing time), and another column (many in the real data) of doubles.

My code should notice the repeated 10:30 (there could me many such repeats), then be able to take in the corresponding doubles (100 and 110) as inputs for some function, f(100,110), and then remove the repeated row from the data.

I.e. if the function were, say, to average, I should have an output that looks something like

data =
       '10:30' [105]
       '10:31' [115]
       '10:32' [110]

This would be fairly simple if loops were fast enough, but with my data set, there is no point in even attempting a solution involving looping through.

I have gotten as far as

[uniqueElements, firstUniquePosition, commonSets] = unique(data(:,1));

after much fiddling around, which yields some information that appears useful,

uniqueElements = 

    '10:30'
    '10:31'
    '10:32'

firstUniquePosition =

     1
     3
     4

commonSets =

     1
     1
     2
     3

but I can't quite figure out how to make a vectorised statement that allows me to manipulate the elements with common dates.

I imagine it will involve cellfun at some point, but I don't know enough of matlab's functionality to implement it yet without a push in the right direction.


Solution

  • That is a job for accumarray:

    [times,~,subs] = unique(data(:,1));
    idx = 1:size(data,1);
    meanOfCommonTimes = accumarray(subs(:),idx(:),[],@(x) mean( [data{x,2}] ))
    
    output = [times num2cell(meanOfCommonTimes)]
    

    output = 
    
        '10:30'    [105]
        '10:31'    [115]
        '10:32'    [110]
    

    Talking about 1 million elements and performance: consider storing your time data as numeric values using datenum function.

    times = datenum(data(:,1),'hh:mm');
    

    and also keep your data in double arrays:

    vals = cell2mat(data(:,2));
    

    The calculations will then be up to 10 times faster!

    [~,~, subs] = unique(times);
    meanOfCommonTimes = accumarray(subs(:),vals(:),[],@mean);
    

    But be aware. that also the conversion takes quite some time. If you do a lot of calculations later on it could wort it.


    Benchmark

    function [t] = bench()
        data = {'10:30', 100; '10:30', 110; '10:31', 115;'10:32', 110};
        data = [repmat(data, 200000, 1)]; % I use a matrix rather than a cell array for the simplicity of randomly generating example data
    
        % functions to compare
        fcns = {
            @() thewaywewalk(data);
            @() Cecilia(data);
        };
    
        thewayw = timeit(fcns{1})
        Ceci = timeit(fcns{2})
    end
    
    function Z = Cecilia(data)
        [uniqueElements, ~, commonSets] = unique(data(:,1));
    
        num_unique = length(uniqueElements);
        Z = zeros(num_unique, 1);
        for ii = 1:num_unique
            Z(ii) = mean([data{commonSets==ii, 2}]);
        end
    end
    function Z = thewaywewalk(data)
        [~,~,subs] = unique(data(:,1));
        idx = 1:size(data,1);
        Z = accumarray(subs(:),idx(:),[],@(x) mean( [data{x,2}] ));
    end
    

    The results are almost equal for an array with 800000 rows.

    thewayw =  1.1483
    Ceci = 1.0957
    

    But again, accumarray would highly profit from a conversion to double before, but the performance of the loop should stay the same in this case.