Sometimes people duplicate spaces in files for ease of human reading. However, it seems that readr'sread_delim
cannot handle this use case.
Example output from PLINK:
FID IID PHENO CNT CNT2 SCORE
0 ERR1136327 -9 2000 417 -0.000263553
0 ERR1136328 -9 2808 755 -0.000119435
0 ERR1136329 -9 1026 242 8.63494e-05
0 ERR1136330 -9 2688 880 0.000517726
0 ERR1136331 -9 1868 567 0.000264016
0 ERR1136332 -9 3522 1368 0.000144985
(initial few lines)
Try reading with read_delim
:
> d = read_delim("data/no_vcf_filtering/plink.profile", delim = " ")
Missing column names filled in: 'X1' [1], 'X3' [3], 'X4' [4], 'X5' [5], 'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11], 'X13' [13], 'X15' [15], 'X16' [16], 'X17' [17], 'X19' [19], 'X20' [20], 'X22' [22], 'X23' [23], 'X24' [24]Parsed with column specification:
cols(
.default = col_character(),
X4 = col_integer(),
IID = col_integer(),
X15 = col_integer(),
X16 = col_integer(),
X17 = col_integer(),
CNT = col_integer(),
X19 = col_double(),
X20 = col_double(),
CNT2 = col_double(),
X22 = col_double(),
X23 = col_double(),
X24 = col_double(),
SCORE = col_double()
)
See spec(...) for full column specifications.
number of columns of result is not a multiple of vector length (arg 1)215 parsing failures.
row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' file 2 2 <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' row 3 3 <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' col 4 4 <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' expected 5 5 <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile'
... ................. ... ......................................................................... ........ ......................................................................... ...... ......................................................................... .... ......................................................................... ... ......................................................................... ... ......................................................................... ........ .........................................................................
See problems(...) for more details.
The obvious solution here doesn't work:
d = read_delim("data/no_vcf_filtering/plink.profile", delim = " +")
Parsed with column specification:
cols(
.default = col_character(),
X4 = col_integer(),
IID = col_integer(),
X15 = col_integer(),
X16 = col_integer(),
X17 = col_integer(),
#etc.
I found a round-about solution by converting the spaces into tabs, joining the lines with newlines, and then reading as tsv (and getting rid of the first empty column in this case), but it shouldn't be this difficult. Am I missing something obvious?
> read_lines("data/no_vcf_filtering/plink.profile") %>% str_replace_all(" +", "\t") %>% str_c(collapse = "\n") %>% read_tsv() %>% .[, -1]
# A tibble: 230 x 6
FID IID PHENO CNT CNT2 SCORE
<int> <chr> <int> <int> <int> <dbl>
1 0 ERR1136327 -9 2000 417 -2.63553e-04
2 0 ERR1136328 -9 2808 755 -1.19435e-04
3 0 ERR1136329 -9 1026 242 8.63494e-05
4 0 ERR1136330 -9 2688 880 5.17726e-04
5 0 ERR1136331 -9 1868 567 2.64016e-04
6 0 ERR1136332 -9 3522 1368 1.44985e-04
7 0 ERR1136333 -9 870 110 -1.25087e-04
8 0 ERR1136334 -9 2936 877 -6.35191e-04
9 0 ERR1136335 -9 3048 914 -2.22427e-06
10 0 ERR1136336 -9 3184 814 2.77346e-04
# ... with 220 more rows
Warning message:
Missing column names filled in: 'X1' [1]
readr::read_table
is the function suited to this format.
> read_table("test.txt")
Parsed with column specification:
cols(
FID = col_integer(),
IID = col_character(),
PHENO = col_integer(),
CNT = col_integer(),
CNT2 = col_integer(),
SCORE = col_double()
)
# A tibble: 6 x 6
FID IID PHENO CNT CNT2 SCORE
<int> <chr> <int> <int> <int> <dbl>
1 0 ERR1136327 -9 2000 417 -2.63553e-04
2 0 ERR1136328 -9 2808 755 -1.19435e-04
3 0 ERR1136329 -9 1026 242 8.63494e-05
4 0 ERR1136330 -9 2688 880 5.17726e-04
5 0 ERR1136331 -9 1868 567 2.64016e-04
6 0 ERR1136332 -9 3522 1368 1.44985e-04
This is also true of the base::
functions - read.table
vs read.delim
.