Search code examples
matlabfile-iocsvutf-8textscan

MATLAB: Convert comma separated single cell to multiple cell array whilst maintaining UTF-8 encoding using textscan


From the beginning.

I have data in a csv file like:

La Loi des rues,/m/0gw3lmk,/m/0gw1pvm
L'Étudiante,/m/0j9vjq5,/m/0h6hft_
The Kid From Borneo,/m/04lrdnn,/m/04lrdnt,/m/04lrdn5,/m/04lrdnh,/m/04lrdnb

etc.

This is in UTF-8 format. I import this file as follows (taken from somewhere else):

feature('DefaultCharacterSet','UTF-8');
fid = fopen(filename,'rt');         %# Open the file
  lineArray = cell(100,1);          %# Preallocate a cell array (ideally slightly
                                    %# larger than is needed)
  lineIndex = 1;                    %# Index of cell to place the next line in
  nextLine = fgetl(fid);            %# Read the first line from the file
  while ~isequal(nextLine,-1)       %# Loop while not at the end of the file
  lineArray{lineIndex} = nextLine;  %# Add the line to the cell array
  lineIndex = lineIndex+1;          %# Increment the line index
  nextLine = fgetl(fid);            %# Read the next line from the file
end
fclose(fid);                        %# Close the file

This makes an array with the UTF-8 text within it. {3x1} array:

'La Loi des rues,/m/0gw3lmk,/m/0gw1pvm'
'L''Étudiante,/m/0j9vjq5,/m/0h6hft_'
'The Kid From Borneo,/m/04lrdnn,/m/04lrdnt,/m/04lrdn5,/m/04lrdnh,/m/04lrdnb'

Now the next part separates each value into an array:

lineArray = lineArray(1:lineIndex-1);              %# Remove empty cells, if needed
  for iLine = 1:lineIndex-1                        %# Loop over lines
    lineData = textscan(lineArray{iLine},'%s',...  %# Read strings
                        'Delimiter',',');
    lineData = lineData{1};                        %# Remove cell encapsulation
    if strcmp(lineArray{iLine}(end),',')           %# Account for when the line
      lineData{end+1} = '';                        %# ends with a delimiter
    end
    lineArray(iLine,1:numel(lineData)) = lineData; %# Overwrite line data
  end

This outputs:

'La Loi des rues'   '/m/0gw3lmk'    '/m/0gw1pvm'    []  []  []
'L''�tudiante'  '/m/0j9vjq5'    '/m/0h6hft_'    []  []  []
'The Kid From    Borneo'    '/m/04lrdnn'    '/m/04lrdnt'    '/m/04lrdn5'    '/m/04lrdnh'    '/m/04lrdnb'

The problem is that the UTF-8 encoding is lost on the textscan (note the question mark I now get whereas it was fine in the previous array).

Question: How do I maintain the UTF-8 coding when it translates the {3x1} array into a 3xN array.

I can't find anything on how to keep UTF-8 encoding in a textscan of an array already in the workspace. Everything is to do with importing a text file which I have no problems with - it is the second step.

Thanks!


Solution

  • Try the following code:

    %# read whole file as a UTF-8 string
    fid = fopen('utf8.csv', 'rb');
    b = fread(fid, '*uint8')';
    str = native2unicode(b, 'UTF-8');
    fclose(fid);
    
    %# split into lines
    lines = textscan(str, '%s', 'Delimiter','', 'Whitespace','\n');
    lines = lines{1};
    
    %# split each line into values
    C = cell(numel(lines),6);
    for i=1:numel(lines)
        vals = textscan(lines{i}, '%s', 'Delimiter',',');
        vals = vals{1};
        C(i,1:numel(vals)) = vals;
    end
    

    The result:

    >> C
    C = 
        'La Loi des rues'        '/m/0gw3lmk'    '/m/0gw1pvm'              []              []              []
        'L'Étudiante'            '/m/0j9vjq5'    '/m/0h6hft_'              []              []              []
        'The Kid From Borneo'    '/m/04lrdnn'    '/m/04lrdnt'    '/m/04lrdn5'    '/m/04lrdnh'    '/m/04lrdnb'
    

    Note that when I tested this, I encoded the input CSV file as "UTF-8 without BOM" (I was using Notepad++ as editor)