Search code examples
rreadr

Read file and make every character in a separate column


I have a huge file (square data file of a sequence alignment) and want to put every position into a separate column, but readr::read_delim for instance can´t take empty delimiters, and for readr::read_fwf it seems that you need to specify every position? I have more than 35000 positions.

Example input file:

EIGMEYRTVSGVAGPLVILDKVKGPKYQEI..... EIGMEYRTVSGVAGPLVILDKVKGPKYQEI..... EIGMEYRTVSGVAGPLVILDKVKGPKYQEI.....

Output: col1 col2 col3 col4 col5 col6.... E I G M E Y..... E I G M E Y..... E I G M E Y.....


Solution

  • readr::read_fwf has a few different ways you can specify the field widths using the col_positions argument. Here's a test file, test.txt:

    Hdvsmf
    Dfhjds
    Dfhjkd
    Dfklds
    Dkjffd
    Dsfjkd
    fkldsf
    

    Assuming you know how many fields there are, you can either specify a vector of field widths (1 character wide, 5 times because there are five fields in this test file):

    read_fwf('test.txt', col_positions = fwf_widths(rep(1, 5)))
    

    This is probably easier than specifying star and end positions for each field. You can also provide a character vector of column names to fwf_widths, like:

    fwf_widths(rep(1, 5), paste0('col', 1:5))
    

    If you don't know how many fields you have, you can also bring it in as a single column and then use tidyr::separate to extract your columns (the sep argument can take a vector of numeric positions, not just delimiters):

    # a data frame with everything in one column named blah
    df1 = read_csv('test.txt', col_names = 'blah')
    field_count = length(df1$blah[1]) # assuming the fields are all same length!
    
    # nb: parentheses for field_count - 1 are super important! you will spend forever debugging this if you miss it
    df1 = df1 %>% separate(blah, into = paste0('col', 1:field_count), sep = 1:(field_count - 1))