From the beginning.
I have data in a csv file like:
La Loi des rues,/m/0gw3lmk,/m/0gw1pvm
L'Étudiante,/m/0j9vjq5,/m/0h6hft_
The Kid From Borneo,/m/04lrdnn,/m/04lrdnt,/m/04lrdn5,/m/04lrdnh,/m/04lrdnb
etc.
This is in UTF-8 format. I import this file as follows (taken from somewhere else):
feature('DefaultCharacterSet','UTF-8');
fid = fopen(filename,'rt'); %# Open the file
lineArray = cell(100,1); %# Preallocate a cell array (ideally slightly
%# larger than is needed)
lineIndex = 1; %# Index of cell to place the next line in
nextLine = fgetl(fid); %# Read the first line from the file
while ~isequal(nextLine,-1) %# Loop while not at the end of the file
lineArray{lineIndex} = nextLine; %# Add the line to the cell array
lineIndex = lineIndex+1; %# Increment the line index
nextLine = fgetl(fid); %# Read the next line from the file
end
fclose(fid); %# Close the file
This makes an array with the UTF-8 text within it. {3x1} array:
'La Loi des rues,/m/0gw3lmk,/m/0gw1pvm'
'L''Étudiante,/m/0j9vjq5,/m/0h6hft_'
'The Kid From Borneo,/m/04lrdnn,/m/04lrdnt,/m/04lrdn5,/m/04lrdnh,/m/04lrdnb'
Now the next part separates each value into an array:
lineArray = lineArray(1:lineIndex-1); %# Remove empty cells, if needed
for iLine = 1:lineIndex-1 %# Loop over lines
lineData = textscan(lineArray{iLine},'%s',... %# Read strings
'Delimiter',',');
lineData = lineData{1}; %# Remove cell encapsulation
if strcmp(lineArray{iLine}(end),',') %# Account for when the line
lineData{end+1} = ''; %# ends with a delimiter
end
lineArray(iLine,1:numel(lineData)) = lineData; %# Overwrite line data
end
This outputs:
'La Loi des rues' '/m/0gw3lmk' '/m/0gw1pvm' [] [] []
'L''�tudiante' '/m/0j9vjq5' '/m/0h6hft_' [] [] []
'The Kid From Borneo' '/m/04lrdnn' '/m/04lrdnt' '/m/04lrdn5' '/m/04lrdnh' '/m/04lrdnb'
The problem is that the UTF-8 encoding is lost on the textscan
(note the question mark I now get whereas it was fine in the previous array).
Question: How do I maintain the UTF-8 coding when it translates the {3x1} array into a 3xN array.
I can't find anything on how to keep UTF-8 encoding in a textscan
of an array already in the workspace. Everything is to do with importing a text file which I have no problems with - it is the second step.
Thanks!
Try the following code:
%# read whole file as a UTF-8 string
fid = fopen('utf8.csv', 'rb');
b = fread(fid, '*uint8')';
str = native2unicode(b, 'UTF-8');
fclose(fid);
%# split into lines
lines = textscan(str, '%s', 'Delimiter','', 'Whitespace','\n');
lines = lines{1};
%# split each line into values
C = cell(numel(lines),6);
for i=1:numel(lines)
vals = textscan(lines{i}, '%s', 'Delimiter',',');
vals = vals{1};
C(i,1:numel(vals)) = vals;
end
The result:
>> C
C =
'La Loi des rues' '/m/0gw3lmk' '/m/0gw1pvm' [] [] []
'L'Étudiante' '/m/0j9vjq5' '/m/0h6hft_' [] [] []
'The Kid From Borneo' '/m/04lrdnn' '/m/04lrdnt' '/m/04lrdn5' '/m/04lrdnh' '/m/04lrdnb'
Note that when I tested this, I encoded the input CSV file as "UTF-8 without BOM" (I was using Notepad++ as editor)