Search code examples
matlabcsvimporttextscan

Cannot import all values at once in Matlab using TextScan


I have a dataset of 1911518 values. I have used the textscan function. But the function returns only 1424458 values. I again created a new variable to get the remaining values but this time it gave me around 9000 values. Does anyone know what the reason for this can be?

n=9
N=1911518

file_id=fopen('CRSP.csv');

C=textscan(file_id,'%s',n,'delimiter', ','); %To get the headers
C_text=textscan(file_id,'%s %s %s %d %d %d %d %f %f',N, 'delimiter' , ','); 

%Returns 1424458

C_text1=textscan(file_id,'%s %s %s %d %d %d %d %f %f',N, 'delimiter' , ','); 

%Returns only 9000 values    

fclose(file_id);

Sample Data

DATE,COMNAM,TICKER,PERMNO,PERMCO,SHROUT,VOL,RET,sprtrn

01/02/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 23700,  0.000000,  0.007793
01/03/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 20800,  0.020000,  0.000950
01/04/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 65300, -0.026144, -0.005826
01/05/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 340600, 0.000000, -0.001587
01/08/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 3400,   0.000000,  0.002821
01/09/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 27200, -0.006711, -0.014568
01/10/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 25400, -0.033784, -0.018000
01/11/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 14000,  0.000000,  0.007034

Solution

  • I would assume, that there is some error in the data, or format pattern does not match the data. Try to extract these lines:

    file_id=fopen('CRSP.csv');
    for idx=1:1424456
        fgetl(file_id); %dump data
    end
    for idx=1:10
        fprintf('%s\n',fgetl(file_id));
    end
    

    If there is an error, it should be at the 2rd or 3nd printed line. Anything special there? Maybe a COMNAM with some special character?

    To read the file, i would use the following code to read line by line:

    file_id=fopen('CRSP.csv');
    line=fgetl(file_id);
    data={};
    int ix=1;
    while(ischar(line))
        [parsed,sindex,eindex] = regexpi(line,'(\d\d/\d\d/\d\d\d\d)\s*, ([\w ]+), ([\w ]+), ([\d]+), ([\d]+), ([\d]+), ([\d]+), ([\d \.]+), ([\d \.]+)','tokens')
        if ~isempty(sindex)&&numel(sindex)==1&&(sindex==1)&&(eindex==numel(x))
            data{end+1}=parsed{1};
        else
            fprintf('Unable to parse line %d with content: %S',ix,line);
        end
        line=fgetl(file_id);
        ix=ix+1;
    end
    

    Short summary of regular expressions:

    '(...)' Everything between is a "token" which is returned

    '([\d .]+)' Numbers, white space and "."

    '([\d .]+)' Numbers and white space

    '([\w ]+)' Word, including white space

    '(\d\d/\d\d/\d\d\d\d)' date

    This expression is a bit "lazy". It not only accepts "0.000" as a number but also "0.0 00." or some other combinations, but it should be enough to detect all errors. If not, the expression has to be improved.