Search code examples
performancematlabdelimitercell-array

Passing argument to cellfun matlab


Hello I have a cell array of char (separated by underscore) that I would like to convert to double. I do it in a for loop, but since the dimensions are very big, it takes a lot of time. I would like to use cellfun, but I don't know how to pass the delimiter.

Can you help me?

listofwords = {'02_04_04_52';'02_24_34_02'};
for i = 1 : size(listofwords,1)
    listofwords_double(i,:) = str2double(strsplit(listofwords{i},'_'))./1000;
end

listofwords_double2= cellfun(@strsplit , listofwords);

Benchmark

As requested by Divakar

>> benchmark1
Speedup with EVAL over NO-LOOP-SSCANF = -46.3398%
>> benchmark1
Speedup with EVAL over NO-LOOP-SSCANF = -46.4068%
>> benchmark1
Speedup with EVAL over NO-LOOP-SSCANF = -47.1129%
>> benchmark1
Speedup with EVAL over NO-LOOP-SSCANF = -46.2882%
>> benchmark1
Speedup with EVAL over NO-LOOP-SSCANF = -46.2325%
>> benchmark1
Speedup with EVAL over NO-LOOP-SSCANF = -46.0161%
>> benchmark1
Speedup with EVAL over NO-LOOP-SSCANF = -46.9728%
>> benchmark1
Speedup with EVAL over NO-LOOP-SSCANF = -46.4267%
>> benchmark1
Speedup with EVAL over NO-LOOP-SSCANF = -46.2867%
>> benchmark1
Speedup with EVAL over NO-LOOP-SSCANF = -46.3031%

Solution

  • You can use anonymous function like this -

    listofwords_double2= cellfun(@(x) strsplit(x,'_') , listofwords,'uni',0)
    

    Another approach with regexp and a one-liner -

    cell2mat(cellfun(@(x) str2double(regexp(x,'_','Split'))./1000 , listofwords,'uni',0))
    

    Performance oriented solutions

    Approach #1

    N = 4; %// Edit this to 10 in your actual case
    cat_cell = strcat(listofwords,'_');
    one_str = [cat_cell{:}];
    one_str(end)=[];
    sep_cells = regexp(one_str,'_','Split');
    out = reshape(str2double(sep_cells),N,[]).'./1000; %//'# desired output
    

    Approach #2

    Benchmarking the above solution suggests strcat could prove to be the bottleneck. To get rid of that you can use a cumsum based approach for that part. This is listed next -

    N = 4; %// Edit this to 10 in your actual case
    
    lens = cellfun(@numel,listofwords);
    tlens = sum(lens);
    idx = zeros(1,tlens); %// Edit this to "idx(1,tlens)=0;" for more performance
    idx(cumsum(lens(1:end-1))+1)=1;
    idx2 = (1:tlens) + cumsum(idx);
    
    one_str(1:max(idx2))='_';
    one_str(idx2) = [listofwords{:}];
    
    sep_cells = regexp(one_str,'_','Split');
    out = reshape(str2double(sep_cells),N,[]).'./1000; %//'# desired output
    

    Approach #3

    Now, this one uses sscanf and appears to be really fast. Here's the code -

    N = 4; %// Edit this to 10 in your actual case
    lens = cellfun(@numel,listofwords);
    tlens = sum(lens);
    idx(1,tlens)=0;
    idx(cumsum(lens(1:end-1))+1)=1;
    idx2 = (1:tlens) + cumsum(idx);
    
    one_str(1:max(idx2)+1)='_';
    one_str(idx2) = [listofwords{:}];
    delim = repmat('%d_',1,N*numel(lens));
    out = reshape(sscanf(one_str, delim),N,[])'./1000; %//'# desired output
    

    Benchmarking

    As requested by @CST-Link, here's the benchmark comparing his "Kraken" eval against approach #3. The benchmarking code would look something like this -

    clear all
    
    listofwords = repmat({'02_04_04_52_23_14_54_672_0'},100000,1);
    for k = 1:50000
        tic(); elapsed = toc(); %// Warm up tic/toc
    end
    
    tic
    N = 9; %// Edit this to 10 in your actual case
    lens = cellfun(@numel,listofwords);
    tlens = sum(lens);
    idx(1,tlens)=0;
    idx(cumsum(lens(1:end-1))+1)=1;
    idx2 = (1:tlens) + cumsum(idx);
    
    one_str(1:max(idx2)+1)='_';
    one_str(idx2) = [listofwords{:}];
    delim = repmat('%d_',1,N*numel(lens));
    out = reshape(sscanf(one_str, delim),N,[])'./1000; %//'# desired output
    time1 = toc;
    clear out delim one_str idx2 idx tlens lens N
    
    tic
    n_numbers = 1+sum(listofwords{1}=='_');
    n_words   = numel(listofwords);
    listofwords_double = zeros(n_numbers, n_words);
    for i = 1:numel(listofwords)
            temp = ['[', listofwords{i}, ']'];
            temp(temp=='_') = ';';
            listofwords_double(:,i) = eval(temp);
    end;
    listofwords_double = (listofwords_double / 1000).';
    time2 = toc;
    speedup = ((time1-time2)/time2)*100;
    disp(['Speedup with EVAL over NO-LOOP-SSCANF = ' num2str(speedup) '%'])
    

    And here are the benchmark results when the code is run for a few number of times -

    >> benchmark1
    Speedup with EVAL over NO-LOOP-SSCANF = 0.30609%
    >> benchmark1
    Speedup with EVAL over NO-LOOP-SSCANF = 0.012241%
    >> benchmark1
    Speedup with EVAL over NO-LOOP-SSCANF = -2.3146%
    >> benchmark1
    Speedup with EVAL over NO-LOOP-SSCANF = 0.33678%
    >> benchmark1
    Speedup with EVAL over NO-LOOP-SSCANF = -1.8189%
    >> benchmark1
    Speedup with EVAL over NO-LOOP-SSCANF = -0.12254%
    

    Looking at the results and observing some negative speedups (indicating sscanf to be better in those cases) among some positive speedups, my opinion would be to stick with sscanf.