Search code examples
matlab

textscan doesn't work well with big files in Matlab?


I'm currently using the latest Matlab version on 16 GB RAM Mac.

I tried to perform a splitting of a really big cube file (100 GB) into smaller cube files with only 210151 lines per file using this code:

%% Splitting
% opening the result.cube file
fid = fopen(cube) ;
if fid == -1
    error('File could not be opened.');
end

 m = 1 ;

while ~feof(fid)
    % skip the alpha and beta density
    fseek(fid,16596786,0) ;

    % copy the spin density
    text = textscan(fid,'%s',210150,'Delimiter','\n','Whitespace','') ;


    % Prints the cube snap shot to the subdirectory 
    name = string(step_nr(m))+'.cube' ;
    full_path = fullfile(name1,name) ;
    fid_new = fopen(full_path,"w") ;
    fprintf(fid_new,'%s\n', text{1}{:}) ;
    fclose(fid_new) ;
    m = m+1 ;
end

fclose(fid) ;

save("steps","step_nr")

end

My problem is: Apparently, textscan is not suited for this kind of files. I also tried with line-by-line copying with fgetl, which on the other hand takes ages for a file of 100 GB. Is there a more efficient way to split the file?

I've read about fscanf and tried this:

tic;
fid = fopen('result.cube');
fgetl(fid) ; fgetl(fid) ;
f = fscanf(fid, '%d %f %f %f', [4 4]) ;
s = fscanf(fid, '%d %f %f %f %f', [5 192]) ;
n = fscanf(fid, '%f %f %f %f %f %f', [6 209953]) ;
fid_new = fopen("new",'w') ;
fprintf(fid_new, '%d %.6f %.6f %.6f\n', f) ;
fprintf(fid_new, '%d %.6f %.6f %.6f %.6f\n', s) ;
fprintf(fid_new, '%f %f %f %f %f\n', n) ;
fclose(fid) ;
t=toc

But my problem here is: s is not aligned in the individual file like in the big file. n is in decimals instead of for example E-02. I also tried to copy it line by line but it takes years. Any suggestions how to improve this? I want it to look like this:

enter image description here


Solution

  • Do you need to do this in Matlab? Can you not just use the split command-line tool?
    https://man7.org/linux/man-pages/man1/split.1.html

    This should do the job:

    split input_file.txt --lines=210151
    

    If you additionally want to skip (discard) the first 16596786 bytes of the input file:

    tail -c +16596786 input_file.txt | split --lines=210151
    

    To first split the input file and then remove the leading 16596786 bytes from each chunk:

    split --lines=210151 input_file output_
    for i in output_*; do tail -c +16596786  ${i} > ${i}.chopped; done