Read and Parse a Text file into a Matrix in Octave

I have a text file with many thousands of rows which look like this

20120601 000000603,1.234610,1.234780,0

where the first two whitespace separated columns are a date and time representation and the following three comma separated columns are data. I want to read the text file into an Octave matrix such that the columns of the matrix are separated thus.

2012 06 01 00 00 00 603 1.234610 1.234780 0

I'm sure the textscan function is what I'll need to use, but I don't know the format string to separate things as I want.

Solution

You can use the function fscanf (see formatted input in the docs in the docs as well as a few other following pages in the same chapter C-Style I/O Functions for explanations about format) to read the data into a matrix, that is to convert the integer values to floating point values.

The line format you describe by your example is probably (minor interpretive variations are possible, like for instance is the last entry one digit long or can it be any decimal):

"%4d%2d%2d %2d%2d%2d%3d,%f,%f,%d"

that reads:

%4d - a four character decimal integer, followed by
%2d - a two character decimal integer,
%2d - a two character decimal integer,
%2d - a two character decimal integer,
%2d - a two character decimal integer,
%2d - a two character decimal integer,
%3d - a three character decimal integer
, a comma
%f - a floating point number (any number of characters),
, - a comma
%f- a floating point number (any number of characters),
, - a comma
%d - a decimal integer (any number of characters)

Note that unlike the commas (and most other characters) spaces between fields are ignored in both the format and the matched input text, so one could have used "%4d%2d%2d%2d%2d%2d%3d,%f,%f,%d" (without the space) format, or "%4d %2d %2d %2d %2d %2d %3d,%f,%f,%d" for better readability

If you have an test file input.txt with the following content:

20120601 000000603,1.234610,1.234780,0
20120602 010203604,2.234610,2.234780,11
20120603 000000605,3.234610,3.234780,22

using

fileID = fopen('input.txt','r');
sizeM = [10, Inf];
M = fscanf(fileID, "%4d%2d%2d %2d%2d%2d%3d,%f,%f,%d", sizeM);
fclose(fileID);

([10, Inf] means that the resulting matrix will have 10 rows and an unlimited number of columns) will produce the matrix M:

   2.0120e+03   2.0120e+03   2.0120e+03
   6.0000e+00   6.0000e+00   6.0000e+00
   1.0000e+00   2.0000e+00   3.0000e+00
            0   1.0000e+00            0
            0   2.0000e+00            0
            0   3.0000e+00            0
   6.0300e+02   6.0400e+02   6.0500e+02
   1.2346e+00   2.2346e+00   3.2346e+00
   1.2348e+00   2.2348e+00   3.2348e+00
            0   1.1000e+01   2.2000e+01

where each column contains the 10 values of the corresponding line in the input text, converted to floating point numbers and in scientific notation (2.0120e+03 for 2012.0). Of course, the matrix can be transposed if one wants to keep input line values in a row.

The function textscan produces a cell array, a heterogeneous structure honouring the different types of the input - in this case integers and floating point numbers, but there can also be strings for instance.

The format should be the same, so one has just to replace the fscanf line above with

M =textscan(fileID, "%4d%2d%2d %2d%2d%2d%3d,%f,%f,%d");