Search code examples
matlabcsvtextscan

Reading large amount of data stored in lines from csv


I need to read in a lot of data (~10^6 data points) from a *.csv-file.

  • the data is stored in lines
  • I can't know how many data points per line and how many lines are there before I read it in
  • the amount of data points per line can be different for each line

So the *.csv-file could look like this:

x Header

x1,x2

y Header

y1,y2,y3, ...

z Header

z1,z2

...

Right now I read in every line as string and split it at every comma. This is what my code looks like:

index = 1;
headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');

while ~isempty(headerLine{1})

    dummy = textscan(csvFileHandle,'%s',1,'Delimiter','\n', ...
                'BufSize',2^31 - 1);
    rawData(index) = textscan(dummy{1}{1},'%f','Delimiter',',');
    headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');

    index = index + 1;
end

It's working, but it's pretty slow. Most of the time is used while splitting the string with textscan. (~95%). I preallocated rawData with sample data, but it brought next to nothing for the speed.

Is there a better way than mine to read in something like this?

If not, is there a faster way to split this string?


Solution

  • First suggestion: to read a single line as a string when looping over a file, just use fgetl (returns a nice single string so no faffing with cell arrays).

    Also, you might consider (if possible), reading everything in a single go rather than making repeating reads from file:

    output = textscan(fid, '%*s%s','Delimiter','\n');  % skips headers with *
    

    If the file is so big that you can't do everything at once, try to read in blocks (e.g. tackle 1000 lines at a time, parsing data as you go).

    For converting the string, there are the options of str2num or strsplit+str2double but the only thing I can think of that might be slightly quicker than textscan is sscanf. Since this doesn't accept the delimiter as a separate input put it in the format string (the last value doesn't end with ,, true, but sscanf can handle that).

    for n = 1:length(output);
        data{n} = sscanf(output{n},'%f,');
    end
    

    Tests with a limited patch of test data suggests sscanf is a bit quicker (but might depend on machine/version/data sizes).