Search code examples
arraysmatlabcsvparsingcells

How to convert strings in a cell array to numerical stand-in values?


In Matlab, I have a cell array of cell arrays created using textscan on a tab-separated data file. Some columns of the data set are strings. Here is an example array, 'data':

data{1,1} = {1; 2; 3; 4; 5};
data{1,2} = {11; 12; 13; 14; 15};
data{1,3} = {'Ringo'; 'Paul'; 'The Beatles'; 'John'; 'George'};

I want to convert the string columns to specific numerical stand-in values (i.e. 'The Beatles' = 1, 'John' = 2, 'Paul' = 3, 'George' = 4, 'Ringo' = 5). Also, it may be important to note that some cells contain more than one "word".

For some context, the entire data set will ultimately be converted from a cell array into one neatly packaged double array (data = cell2mat(data)) for easier manipulation.

The only solution I know is to loop through the string array and use a switch statement:

vec_0 = data{1,3};
for ii = 1:length(vec_0)
    switch vec_0{ii}
        case 'The Beatles'
            vec_f{ii,1} = 1;
        case 'John'
            vec_f{ii,1} = 2;
        case 'Paul'
            vec_f{ii,1} = 3;
        case 'George'
            vec_f{ii,1} = 4;
        case 'Ringo'
            vec_f{ii,1} = 5;
    end
end

% Replace string column with numerical stand-ins
data{1,3} = vec_f;

% Convert cell array to double array
data = cell2mat(data);

Now we have:

data = [1 11 5; 2 12 3; 3 13 1; 4 14 2; 5 15 4];

What is a more optimal way to do this?


Solution

  • If you have a specific mapping of strings to numeric values...

    Let's say you have a mapping of strings to numeric values as defined in the following 5-by-2 cell array (one mapping per row):

    numMap = {'The Beatles', 1; 'John' , 2; 'Paul' , 3; 'George' , 4; 'Ringo' , 5};
    

    Then you can use ismember to convert the strings to their mapped numeric values and save it back in data like so:

    [~, index] = ismember(data{1, 3}, numMap(:, 1));
    data{1, 3} = numMap(index, 2);
    

    If you need to generate a mapping of strings to numeric values...

    You can generate an array of numerical stand-ins with unique, convert that to a cell array with num2cell, and save it back in data like so:

    [uniqueStrings, ~, numIndex] = unique(data{1, 3}, 'stable');
    data{1, 3} = num2cell(numIndex);
    

    And uniqueStrings will contain the unique multi-word strings from data{1, 3}.

    Converting to a double array:

    Once you've chosen one of the above options, you can then convert your sample data to a 5-by-3 double array like so:

    data = cell2mat([data{:}]);
    
    data =
         1    11     5
         2    12     3
         3    13     1
         4    14     2
         5    15     4