Search code examples
rregextext-parsing

extract and organise textfile to dataframe


I have a huge text file with the following structure:

AA<-tibble::tribble(
  ~`-------------------------------------------------`,
  "ABCD 2002201234 09-06-2015 10:34",
  "-------------------------------------------------",
  "Lorem ipsum",
  "Lorem ipsum",
  "Lorem ipsum Lorem ipsum",
  "Lorem ipsum: Lorem ipsum",
  "123456",
  "AB",
  "AB",
  "Lorem ipsum",
  "-------------------------------------------------",
  "ABCDEF 1001101234 05-03-2011 09:15",
  "-------------------------------------------------",
  "TEST",
  "TEST"
)

I want to organise the above into a DF with variables: ID, DATE and TEXT. ID should be the 10-digit number (in the example 2002201234 and 1001101234) DATE is self explanatory and TEXT should be all text between the bottom line ("-------------") to the upper line of next post.

Which is the easiest way to perform this?

Regards, H


Solution

  • in base R:

    x <- paste(AA[[1]], collapse = '\n')
    y <- regmatches(x, gregexec("(\\d{10}) *(.*?)\n-+([^-]+)", x, perl = TRUE))[[1]]
    setNames(data.frame(t(y[2:4,])), c('ID', 'Date', 'Text'))
    
      ID         Date             Text                                        
      <chr>      <chr>            <chr>                                       
    1 2002201234 09-06-2015 10:34 "\nLorem ipsum\nLorem ipsum\nLorem ipsum Lo…
    2 1001101234 05-03-2011 09:15 "\nTEST\nTEST"