Search code examples
matlaboptimizationcastingtype-conversionmatlab-table

Fastest way to convert numerical table values to character arrays


I have a table which contains various data types. For instance,

chars = {'a'; 'bc'; 'de'};
nums = [1; 20; 300];
tbl = table(chars, nums);

I want to convert all numerical columns to character array variables. By this, I mean the result should be equivalent to if I had initialised the table using:

nums = {'1'; '20', '300'};
tbl = table(chars, nums); 

I want to do this as fast as possible because I have a table with potentially millions of rows.

Current working solutions:

Firstly, I get the numerical columns, in the case of this example this will be column 2...

% Get the columns numbers which are numerical. This works fine, although a bit clunky.
numcols = find(varfun(@isnumeric, tbl(1,:), 'output', 'uniform'));    
colnames = tbl.Properties.VariableNames(numcols); % Get the corresponding column names

Then, I have a couple of methods which convert those numerical columns to character array types, they both involve looping over the numerical columns and using some other loop-in-disguise ___fun function...

  1. arrayfun

    for ii = 1:numel(colnames)
        % This arrayfun is *slow* for large tables
        tbl.(colnames{ii}) = arrayfun( @num2str, tbl.(colnames{ii}), 'uniformoutput', 0 );
    end
    
  2. num2cell and cellfun

    for ii = 1:numel(colnames)
        % num2cell is relatively quick
        tbl.(colnames{ii}) = num2cell( tbl.(colnames{ii}) ); 
        % cellfun is about as slow as arrayfun, as might be expected
        tbl.(colnames{ii}) = cellfun( @num2str, tbl.(colnames{ii}), 'uniformoutput', 0 );
    end
    

Speed test below, note I'm only doing one column and 1e5 elements, whilst in reality I want to do many columns and potentially ten times more rows, so you can clearly see the speed issues...

% Setup
nums = (1:1e5).'; tbl = table(nums);
% Functions
f1 = @() arrayfun( @num2str, tbl.nums, 'uni', 0 );
f2 = @() cellfun( @num2str, num2cell(tbl.nums), 'uni', 0 );
% Timing
timeit(f1) % >> 5.15 sec
timeit(f2) % >> 5.16 sec

You can see the methods are basically equivalent, as probably expected from their similarity.

Does anyone know a faster way to convert all data in a table to character array variable type? I had thought to go via categoricals, but not sure how to proceed there.

I'd prefer a solution compatible with R2015b.

Note: It is relevant that my data contains mixed types, since I cannot (and wouldn't want to) use varfun( ... ) on the entire table.


Solution

  • You can use sprintf to convert to large char array and then read that back using strread to put in a cell array:

    % Setup
    nums = (1:1e5).'; tbl = table(nums);
    % Functions
    f1 = @() arrayfun( @num2str, tbl.nums, 'uni', 0 );
    f2 = @() cellfun( @num2str, num2cell(tbl.nums), 'uni', 0 );
    f3 = @() strread ( sprintf ( '%i\n', tbl.nums ), '%s', 'delimiter', '\n' );
    f4 = @() textscan ( sprintf ( '%i\n', tbl.nums ), '%s', 'delimiter', '\n' );
    % Timing
    timeit(f1) % 
    timeit(f2) % 
    timeit(f3) % 
    timeit(f4) %
    
    
    r1 = feval(f1);
    r2 = feval(f2);
    r3 = feval(f3);
    r4 = feval(f4);
    
    % check they are equal
    isequal ( r1, r2 )
    isequal ( r3, r1 )
    isequal ( r4{1}, r1 )
    

    On my computer with R2015b I get:

    f1 -> 3.78 seconds
    f2 -> 3.79 seconds
    f3 -> 0.10 seconds
    f4 -> 0.07 seconds
    

    And as per the isequal checks they are all the same.

    If your data is non integer -> you will need to change the sprintf statement, I put it as integer to ensure that the isequal verified that the results were the same.