Search code examples
matlabtextscan

Accommodating blank entries in .txt files using textscan - MATLAB


I have a 9-column tab-delimited .txt file containing numerous data-formats - some entries are however empty within 'type'.

id  id_2 s1      s2      st1     st2          type         desig  num
1   1   51371   51434   52858   52939   5:3_4:4_6:2_4:4_2:6 CO     1
2   1   108814  108928  109735  110856  5:3_4:4_6:2_4:4_2:7 CO     2
3   1   130975  131303  131303  132066  5:3_4:4_6:2_4:4_2:8 NCO    3
4   1   191704  191755  194625  194803                      NCO    4
5   2   69355   69616   69901   70006                       CO     5
6   2   202580  202724  204536  205151  5:3_4:4_6:2_4:4     CO     6

Due to the mixed format types, i've been using textscan to import this data:

data = textscan(fid1, '%*f %f %f %f %f %f %*s %s %*[^\r\n]','HeaderLines',1);

To take columns 2-6, skip 'type' and take the 8th column.

This approach fails on rows with empty entries - it skips this as if it was not a column and instead of taking 'NCO' or 'CO' it will take '4' or '5'.

Is there a way to prevent this? I know I could alter the original .txt files to include something like 'NA' for empty entries but this is less desirable than a more robust way to read such files.

EDIT:

In addition to the answer below, simply specifying the delimiter used appears to fix the issue:

data = textscan(fid1, '%*f %f %f %f %f %f %*s %s %*[^\r\n]','HeaderLines',1,'delimiter','\t');

Solution

  • Here's one approach with importdata and strsplit -

    %// Read in data with importdata
    data = importdata('data1.txt') %// 'data1.txt' is the input text file
    
    %// Split data
    split_data = cellfun(@(x) strsplit(x,' '),data,'Uni',0)
    
    N = numel(split_data) %// number of rows in input textfile
    
    %// Setup output cell and mask arrays
    out_cell = cell(9,N)
    mask = true(9,N)
    
    %// Set the "type" entry as zero in mask array for the rows in textfile
    %// that has corresponding entry missing
    mask(7,cellfun(@length,split_data)~=9)=0
    
    %// Use mask to set cells in out_cell from split data entries
    out_cell(mask) = [split_data{:}]
    out = out_cell'
    

    Sample run -

    >> type data1.txt
    
    id  id_2 s1      s2      st1     st2          type         desig  num
    1   1   51371   51434   52858   52939   5:3_4:4_6:2_4:4_2:6 CO     1
    2   1   108814  108928  109735  110856  5:3_4:4_6:2_4:4_2:7 CO     2
    3   1   130975  131303  131303  132066  5:3_4:4_6:2_4:4_2:8 NCO    3
    4   1   191704  191755  194625  194803                      NCO    4
    5   2   69355   69616   69901   70006                       CO     5
    6   2   202580  202724  204536  205151  5:3_4:4_6:2_4:4     CO     6
    >> out
    out = 
        'id'    'id_2'    's1'        's2'        'st1'       'st2'       'type'                   'desig'    'num'
        '1'     '1'       '51371'     '51434'     '52858'     '52939'     '5:3_4:4_6:2_4:4_2:6'    'CO'       '1'  
        '2'     '1'       '108814'    '108928'    '109735'    '110856'    '5:3_4:4_6:2_4:4_2:7'    'CO'       '2'  
        '3'     '1'       '130975'    '131303'    '131303'    '132066'    '5:3_4:4_6:2_4:4_2:8'    'NCO'      '3'  
        '4'     '1'       '191704'    '191755'    '194625'    '194803'                       []    'NCO'      '4'  
        '5'     '2'       '69355'     '69616'     '69901'     '70006'                        []    'CO'       '5'  
        '6'     '2'       '202580'    '202724'    '204536'    '205151'    '5:3_4:4_6:2_4:4'        'CO'       '6'