Search code examples
rdelimiterreadr

readr: best practice for reading where one needs to merge delimiters?


Sometimes people duplicate spaces in files for ease of human reading. However, it seems that readr'sread_delim cannot handle this use case.

Example output from PLINK:

 FID          IID  PHENO    CNT   CNT2    SCORE
   0   ERR1136327     -9   2000    417 -0.000263553
   0   ERR1136328     -9   2808    755 -0.000119435
   0   ERR1136329     -9   1026    242 8.63494e-05
   0   ERR1136330     -9   2688    880 0.000517726
   0   ERR1136331     -9   1868    567 0.000264016
   0   ERR1136332     -9   3522   1368 0.000144985

(initial few lines)

Try reading with read_delim:

> d = read_delim("data/no_vcf_filtering/plink.profile", delim = " ")
Missing column names filled in: 'X1' [1], 'X3' [3], 'X4' [4], 'X5' [5], 'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11], 'X13' [13], 'X15' [15], 'X16' [16], 'X17' [17], 'X19' [19], 'X20' [20], 'X22' [22], 'X23' [23], 'X24' [24]Parsed with column specification:
cols(
  .default = col_character(),
  X4 = col_integer(),
  IID = col_integer(),
  X15 = col_integer(),
  X16 = col_integer(),
  X17 = col_integer(),
  CNT = col_integer(),
  X19 = col_double(),
  X20 = col_double(),
  CNT2 = col_double(),
  X22 = col_double(),
  X23 = col_double(),
  X24 = col_double(),
  SCORE = col_double()
)
See spec(...) for full column specifications.
number of columns of result is not a multiple of vector length (arg 1)215 parsing failures.
row # A tibble: 5 x 5 col     row   col   expected     actual                                  file expected   <int> <chr>      <chr>      <chr>                                 <chr> actual 1     1  <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' file 2     2  <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' row 3     3  <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' col 4     4  <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' expected 5     5  <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile'
... ................. ... ......................................................................... ........ ......................................................................... ...... ......................................................................... .... ......................................................................... ... ......................................................................... ... ......................................................................... ........ .........................................................................
See problems(...) for more details.

The obvious solution here doesn't work:

d = read_delim("data/no_vcf_filtering/plink.profile", delim = " +")
Parsed with column specification:
cols(
  .default = col_character(),
  X4 = col_integer(),
  IID = col_integer(),
  X15 = col_integer(),
  X16 = col_integer(),
  X17 = col_integer(),
#etc.

I found a round-about solution by converting the spaces into tabs, joining the lines with newlines, and then reading as tsv (and getting rid of the first empty column in this case), but it shouldn't be this difficult. Am I missing something obvious?

> read_lines("data/no_vcf_filtering/plink.profile") %>% str_replace_all(" +", "\t") %>% str_c(collapse = "\n") %>% read_tsv() %>% .[, -1]
# A tibble: 230 x 6
     FID        IID PHENO   CNT  CNT2        SCORE
   <int>      <chr> <int> <int> <int>        <dbl>
 1     0 ERR1136327    -9  2000   417 -2.63553e-04
 2     0 ERR1136328    -9  2808   755 -1.19435e-04
 3     0 ERR1136329    -9  1026   242  8.63494e-05
 4     0 ERR1136330    -9  2688   880  5.17726e-04
 5     0 ERR1136331    -9  1868   567  2.64016e-04
 6     0 ERR1136332    -9  3522  1368  1.44985e-04
 7     0 ERR1136333    -9   870   110 -1.25087e-04
 8     0 ERR1136334    -9  2936   877 -6.35191e-04
 9     0 ERR1136335    -9  3048   914 -2.22427e-06
10     0 ERR1136336    -9  3184   814  2.77346e-04
# ... with 220 more rows
Warning message:
Missing column names filled in: 'X1' [1]

Solution

  • readr::read_table is the function suited to this format.

    > read_table("test.txt")
    Parsed with column specification:
    cols(
      FID = col_integer(),
      IID = col_character(),
      PHENO = col_integer(),
      CNT = col_integer(),
      CNT2 = col_integer(),
      SCORE = col_double()
    )
    # A tibble: 6 x 6
        FID        IID PHENO   CNT  CNT2        SCORE
      <int>      <chr> <int> <int> <int>        <dbl>
    1     0 ERR1136327    -9  2000   417 -2.63553e-04
    2     0 ERR1136328    -9  2808   755 -1.19435e-04
    3     0 ERR1136329    -9  1026   242  8.63494e-05
    4     0 ERR1136330    -9  2688   880  5.17726e-04
    5     0 ERR1136331    -9  1868   567  2.64016e-04
    6     0 ERR1136332    -9  3522  1368  1.44985e-04
    

    This is also true of the base:: functions - read.table vs read.delim.