I have a dataset of 1911518 values. I have used the textscan function. But the function returns only 1424458 values. I again created a new variable to get the remaining values but this time it gave me around 9000 values. Does anyone know what the reason for this can be?
n=9
N=1911518
file_id=fopen('CRSP.csv');
C=textscan(file_id,'%s',n,'delimiter', ','); %To get the headers
C_text=textscan(file_id,'%s %s %s %d %d %d %d %f %f',N, 'delimiter' , ',');
%Returns 1424458
C_text1=textscan(file_id,'%s %s %s %d %d %d %d %f %f',N, 'delimiter' , ',');
%Returns only 9000 values
fclose(file_id);
Sample Data
DATE,COMNAM,TICKER,PERMNO,PERMCO,SHROUT,VOL,RET,sprtrn
01/02/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 23700, 0.000000, 0.007793
01/03/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 20800, 0.020000, 0.000950
01/04/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 65300, -0.026144, -0.005826
01/05/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 340600, 0.000000, -0.001587
01/08/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 3400, 0.000000, 0.002821
01/09/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 27200, -0.006711, -0.014568
01/10/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 25400, -0.033784, -0.018000
01/11/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 14000, 0.000000, 0.007034
I would assume, that there is some error in the data, or format pattern does not match the data. Try to extract these lines:
file_id=fopen('CRSP.csv');
for idx=1:1424456
fgetl(file_id); %dump data
end
for idx=1:10
fprintf('%s\n',fgetl(file_id));
end
If there is an error, it should be at the 2rd or 3nd printed line. Anything special there? Maybe a COMNAM
with some special character?
To read the file, i would use the following code to read line by line:
file_id=fopen('CRSP.csv');
line=fgetl(file_id);
data={};
int ix=1;
while(ischar(line))
[parsed,sindex,eindex] = regexpi(line,'(\d\d/\d\d/\d\d\d\d)\s*, ([\w ]+), ([\w ]+), ([\d]+), ([\d]+), ([\d]+), ([\d]+), ([\d \.]+), ([\d \.]+)','tokens')
if ~isempty(sindex)&&numel(sindex)==1&&(sindex==1)&&(eindex==numel(x))
data{end+1}=parsed{1};
else
fprintf('Unable to parse line %d with content: %S',ix,line);
end
line=fgetl(file_id);
ix=ix+1;
end
Short summary of regular expressions:
'(...)' Everything between is a "token" which is returned
'([\d .]+)' Numbers, white space and "."
'([\d .]+)' Numbers and white space
'([\w ]+)' Word, including white space
'(\d\d/\d\d/\d\d\d\d)' date
This expression is a bit "lazy". It not only accepts "0.000" as a number but also "0.0 00." or some other combinations, but it should be enough to detect all errors. If not, the expression has to be improved.