Search code examples
rcsvimporttext-filesdelimiter

How can I importsmessy text files in R


Does anybody have some advise on how to import a text file that looks like this:

"X1"II"X2"II"X3"II"X4"II"X5"""1"II4II"123-23"II01-03-2006II"209"II"1"II5II"124-23"II02-03-2006II"208"II....(etc.)?

into R and convert it into a dataframe? So i would like achieve something like this:

| X1 | X2 |X3 | X4 | X5 | | -- | -- | ------- | ---------- | --- | | 1 | 4 | 123-23 | 01-03-2006 | 209 | | 1 | 5 | 124-23 | 02-03-2006 | 208 |
.....

I managed to use read.file to import it as a long string but got stuck after that. I'm grateful for any help.


Solution

  • I copied your text into a text file,

    "X1"II"X2"II"X3"II"X4"II"X5"" "1"II4II"123-23"II01-03-2006II"209"II "1"II5II"124-23"II02-03-2006II"208"

    It seems from inspection that

    • The header row is X1 X2 X3 X4 X5
    • Columns are separated by II.
    • newline indicator is that rectangle , which after reading in using readr::read_file becomes \v

    based on that, you are looking for a data.frame with 5 columns. NOTE: some of the line endings come after II (like "209"II ) which is odd given it suggests the end of row (I have had to add a fix to that in the code below).

    Since functions like read.table require the sep variable to be 1 byte, you cannot use something like read.table(file = 'text.txt', sep = 'II'). So a current working solution is

    library(magrittr)
    library(stringr)
    library(readr)
    
    text <- readr::read_file(file = 'C:/Users/lcroote/my_data/read_test.txt')
    
    text %>% 
      str_replace_all('\"', '') %>% # remove escaped quotes (readr thing)
      str_replace_all('II', ',') %>% # columns separated by II
      str_replace_all(',\v', '\n') %>% # some line endings have extra ,
      str_replace_all('\v', '\n') %>%  # replace \v by newline \n for read.table
      read.table(text = ., sep = ',', header = T, fill = T, row.names = NULL)
    >
       X1 X2     X3         X4  X5
    1  1  4 123-23 01-03-2006 209
    2  1  5 124-23 02-03-2006 208